mrocklin commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1825750536
> You can keep those columns dictionary-encoded in Arrow by passing read_dictionary=["l_returnflag", "l_linestatus", "l_shipinstruct", "l_shipmode"]. This seems to save around 25% CPU time (and also makes the data much more compact in memory). I'm guessing that converting these to pandas dataframes would result in them being categorical dtype series. Is that correct? > The files don't seem to be at fault (except perhaps for using Snappy :-)). What would folks recommend as default compression? LZ4? If so, @milesgranger maybe it's easy to change the data generation scripts in some way with this change? I'd be fine changing things in the benchmark if we think it's a good global recommendation. (For context, I don't like changing things in benchmarks to make performance better because it results in over-tuning and non-realistic results, but if the change is good general practice as recommended by other people then it feels better I think). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
