[I] Need help running benchmarks and other pyspark jobs. [datafusion-comet]

via GitHub Sun, 16 Feb 2025 15:04:49 -0800


Noah-FetchRewards opened a new issue, #1411:
URL: https://github.com/apache/datafusion-comet/issues/1411


   ### Describe the bug
   
   I am trying to get a pyspark job to run on apache comet using an EKS 
cluster; however after 20+ hours, I am unable to do so for a variety of reasons.
   
   Explanation:
   
   I tried to follow the example used in the benchmark 
https://github.com/apache/datafusion-comet/tree/main/benchmarks, where I built 
the image locally, got it uploaded to ECR, then tried to use the spark submit 
command to run the tpcbench.py  (after updating the .jar files to point to the 
latest version); however, the sample data to run the data at,
   conf 
spark.kubernetes.executor.volumes.hostPath.tpcdata.options.path=/mnt/bigdata/tpcds/sf100/
   is obviously not in the image and I gave up pretty quickly, as I am unsure 
how to get the data loaded into the container itself.
   
   I then tried to run the pyspark files that were already locally installed 
into the base datafusion-comet image, specifically:
   local:///opt/spark/examples/src/main/python/pi.py
   
   However, I would just see the message:
   25/02/16 17:42:12 INFO LoggingPodStatusWatcherImpl: Application status for 
spark-4a588861176e45218dc96d608820b902 (phase: Running)
   
   Run indefinitely, and it never seems to actually complete, so I can't seem 
to confirm if pyspark jobs actually work on Apache comet. 
   
   Using the command:
   
   `
   spark-submit \
       --master $SPARK_MASTER \
       --deploy-mode cluster  \
       --name comet-tpcbench \
       --driver-memory 8G \
       --conf spark.driver.memory=8G \
       --conf spark.executor.instances=1 \
       --conf spark.executor.memory=32G \
       --conf spark.executor.cores=8 \
       --conf spark.cores.max=8 \
       --conf spark.task.cpus=1 \
       --conf spark.executor.memoryOverhead=3G \
       --jars local://$COMET_JAR \
       --conf spark.executor.extraClassPath=$COMET_JAR \
       --conf spark.driver.extraClassPath=$COMET_JAR \
       --conf spark.plugins=org.apache.spark.CometPlugin \
       --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions 
\
       --conf spark.comet.enabled=true \
       --conf spark.comet.exec.enabled=true \
       --conf spark.comet.exec.all.enabled=true \
       --conf spark.comet.cast.allowIncompatible=true \
       --conf spark.comet.exec.shuffle.enabled=true \
       --conf spark.comet.exec.shuffle.mode=auto \
       --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
       --conf spark.kubernetes.namespace=default \
       --conf spark.kubernetes.driver.pod.name=tpcbench  \
       --conf spark.kubernetes.container.image=$COMET_DOCKER_IMAGE \
       local:///opt/spark/examples/src/main/python/pi.py
   `
   
   
   I also tried to run the examples on 
https://datafusion.apache.org/comet/user-guide/kubernetes.html using the 
spark-operator (after fixing the outdated jars again), but the job seems to run 
for a while and not complete or give me logs.
   
   
   I'm not particularly experienced with spark on kuberentes, but how exactly 
am I supposed to run a job to completion using apache comet at all?
   I even bothered to install microk8s, and tried to follow the examples to run 
it there, to no avail.
   
   I would also appreciate links to useful reference materials to getting 
started doing this sort of work. 
   
   I'd like to not I also spent 2 days trying to get apache ballista to work 
with the existing examples, only to also run into a plethora of strange bugs as 
well, I was hoping Comet would be alot easier to be productive on.
    
    
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Need help running benchmarks and other pyspark jobs. [datafusion-comet]

Reply via email to