Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Fri, 24 Nov 2023 06:27:27 -0800


mrocklin commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1825750536


   > You can keep those columns dictionary-encoded in Arrow by passing 
read_dictionary=["l_returnflag", "l_linestatus", "l_shipinstruct", 
"l_shipmode"]. This seems to save around 25% CPU time (and also makes the data 
much more compact in memory).
   
   I'm guessing that converting these to pandas dataframes would result in them 
being categorical dtype series.  Is that correct?
   
   > The files don't seem to be at fault (except perhaps for using Snappy :-)).
   
   What would folks recommend as default compression?  LZ4?
   
   If so, @milesgranger maybe it's easy to change the data generation scripts 
in some way with this change?  I'd be fine changing things in the benchmark if 
we think it's a good global recommendation.  (For context, I don't like 
changing things in benchmarks to make performance better because it results in 
over-tuning and non-realistic results, but if the change is good general 
practice as recommended by other people then it feels better I think).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to