seddonm1 commented on pull request #8760: URL: https://github.com/apache/arrow/pull/8760#issuecomment-744076299
Hi @andygrove So I looked at this over the weekend. I was thinking that we could just embed the expected TPC-H answers (given the deterministic inputs we have chosen) and store them as Parquet. This is similar to how the Databricks TPC-H works: https://github.com/databricks/tpch-dbgen/tree/master/answers. I have produced results with Spark with single partition parquets (snappy) and that would require somewhere around 3.5MB. We could also do a limit `n` (which is what it looks like databricks have done) and just check the answers are contained in the result set to reduce data requirements. I have also had trouble generating the parquet files with your program. Here is my alternative way to generate the test dataset which needs to be run from within the `tpch-dbgen` directory (or you can change the volume mount): ```bash docker run \ --rm \ --volume $(pwd):/tpch:Z \ --env "ETL_CONF_ENV=production" \ --env "CONF_NUM_PARITIONS=10" \ --env "INPUT_PATH=/tpch/tbl" \ --env "OUTPUT_PATH=/tpch/parquet" \ --env "SCHEMA_PATH=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/schema" \ --entrypoint="" \ --publish 4040:4040 \ ghcr.io/tripl-ai/arc:arc_3.6.2_spark_3.0.1_scala_2.12_hadoop_3.2.0_1.10.0 \ bin/spark-submit \ --master local\[\*\] \ --driver-memory 4G \ --driver-java-options "-XX:+UseG1GC" \ --class ai.tripl.arc.ARC \ /opt/spark/jars/arc.jar \ --etl.config.uri=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/tpch.ipynb ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
