camuel commented on issue #18411:
URL: https://github.com/apache/datafusion/issues/18411#issuecomment-3483867131

   Thank you. I see that DataFusion is faster on your end and DuckDB is slower
   so the difference is completely off.
   
   Somehow on my end I have almost the same factor on amd and on macbook which
   is about x3. Perhaps it has something to do with the way parquets are
   generated. I've tried to regenerate SF1000 with the gentpch-rs defaults and
   it produces single parquet files for each table and while DuckDB
   performance hasn't changed much, DataFusion performance got worse by factor
   of x1.36 so the difference got even bigger
   
   Tried to build with native flag as Andrew Lamb suggested above performance
   got improved only by less than 10%
   
   I think somehow DuckDB is slow on your endb and not DataFusion being slow
   on my end.
   
   If we just focus on DataFusion performance,  it is easy to see, that for
   low cardinality grouping with two dict encoded columns it is spending too
   much time on hashing and on memcmp, it is two single character columns with
   4 unique character combinations, so many opportunities for optimization. I
   am sure neither DuckDB nor DataFusion exploit all of it, but DuckDB somehow
   exploits more.
   
   
   On Mon, Nov 3, 2025, 10:01 AM Bruce Ritchie ***@***.***>
   wrote:
   
   > *Omega359* left a comment (apache/datafusion#18411)
   > <https://github.com/apache/datafusion/issues/18411#issuecomment-3481826157>
   >
   > This is the samply output I see on my machine:
   > image.png (view on web)
   > 
<https://github.com/user-attachments/assets/b9ffc9e7-fc43-4d7b-96d8-25d7331cd723>
   >
   > —
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/datafusion/issues/18411#issuecomment-3481826157>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AADGCJH3OZW5VZVCAPJDFJL326C47AVCNFSM6AAAAACKZKEJA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIOBRHAZDMMJVG4>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to