seddonm1 commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-744076299


   Hi @andygrove 
   So I looked at this over the weekend. I was thinking that we could just 
embed the expected TPC-H answers (given the deterministic inputs we have 
chosen) and store them as Parquet. This is similar to how the Databricks TPC-H 
works: https://github.com/databricks/tpch-dbgen/tree/master/answers. 
   
   I have produced results with Spark with single partition parquets (snappy) 
and that would require somewhere around 3.5MB. We could also do a limit `n` 
(which is what it looks like databricks have done) and just check the answers 
are contained in the result set to reduce data requirements.
   
   I have also had trouble generating the parquet files with your program. Here 
is my alternative way to generate the test dataset which needs to be run from 
within the `tpch-dbgen` directory (or you can change the volume mount):
   
   ```bash
   docker run \
   --rm \
   --volume $(pwd):/tpch:Z \
   --env "ETL_CONF_ENV=production" \
   --env "CONF_NUM_PARITIONS=10" \
   --env "INPUT_PATH=/tpch/tbl" \
   --env "OUTPUT_PATH=/tpch/parquet" \
   --env 
"SCHEMA_PATH=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/schema";
 \
   --entrypoint="" \
   --publish 4040:4040 \
   ghcr.io/tripl-ai/arc:arc_3.6.2_spark_3.0.1_scala_2.12_hadoop_3.2.0_1.10.0 \
   bin/spark-submit \
   --master local\[\*\] \
   --driver-memory 4G \
   --driver-java-options "-XX:+UseG1GC" \
   --class ai.tripl.arc.ARC \
   /opt/spark/jars/arc.jar \
   
--etl.config.uri=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/tpch.ipynb
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to