jordepic commented on PR #4487:
URL: 
https://github.com/apache/datafusion-comet/pull/4487#issuecomment-4745725228

    
┌───────────────────────┬────────────┬──────────────┬────────────────────┬───────────┐
     │                       │ A baseline │ B comet read │ C comet read+write │ 
total A→C │
     
├───────────────────────┼────────────┼──────────────┼────────────────────┼───────────┤
     │ 100 columns (2M rows) │ 16.53s     │ 13.83s       │ 4.57s              │ 
3.6×      │
     
├───────────────────────┼────────────┼──────────────┼────────────────────┼───────────┤
     │ 1 column (86.5M rows) │ 7.55s      │ 7.74s        │ 3.16s              │ 
2.4×      │
     
└───────────────────────┴────────────┴──────────────┴────────────────────┴───────────┘
   
   Like every great benchmark, this was taken on my Mac.
   
   Methodology:
   - Write a 2 million row parquet file, which came to 1.5 GB, use it for all 
tests
   - Read it in and write it back to iceberg
   - We can read the file with or without comet, and write it with or without 
comet
   - Use reading in comet but writing with iceberg-java as a control
   
   You can see that the full native pipeline is 3.6x as fast as the writing 
pipeline with 100 columns.  You can see that much of the performance gain can 
be attributed to enabling native writes.  However, I postulated that much of 
that could have been due to having to pivot data. For that reason, I made a 
subsequent test doing the same thing but on a 1 column parquet file. You can 
see that the majority of improvement in performance actually comes from 
*enabling* writes, even though the relative difference between all JVM and all 
native is less.
   
   In practice, I've had some phenomenal numbers. I'm working with a 5k column 
dataset that is virtually unwritable with Comet.  Now, I can treat it like any 
other dataset!  I find that datasets with fewer columns may exhibit equal 
performance to spark, I imagine there may be some weird overhead spinning up 
comet executors on K8s but I haven't investigated this too much yet.  I'll 
attach an image of a query plan too.
   
   <img width="730" height="1726" alt="image (17)" 
src="https://github.com/user-attachments/assets/2aa0e024-3f4b-4c5c-bb46-a2edc74e436c";
 />
   
   Here is a plan that goes through many different join operations and keeps 
the whole pipeline columnar, including the write at the end! This can be really 
impactful for ETL jobs and compactions :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to