jorisvandenbossche commented on pull request #9504: URL: https://github.com/apache/arrow/pull/9504#issuecomment-780876347
Cool! I don't have time right now to give it a more detailed review, but I quickly fetched the branch, and even for a not yet optimized first version, this is already *much* faster as the pure python pandas `to_csv` writer (in pandas only the csv reader is optimized, not the writer). With a small example (50,000 rows, 5 columns, with floats/int/string), I get 140ms with pandas, and 20ms with this branch (in release build) which even included the conversion pandas->arrow (but that's only 2-3ms in this case). Few things I noticed / random thoughts: * When writing a column of floats that don't have decimals, no decimal point is included in the output, so it looks like an int (so eg `1` instead of `1.0`, pandas writes the latter). Not sure if we want to preserve this "type information" in this case. * Pandas doesn't use quoting by default, also for string columns. I am not fully sure what makes the most sense as default option, but disabling quoting can be a follow-up enhancement. * We don't support casting timestamps to strings (yet), so that will be a useful addition to casting to be used here ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org