[GitHub] [arrow] jorisvandenbossche commented on pull request #9504: ARROW-2229: [C++][Python] Add WriteCsv functionality.

GitBox Wed, 17 Feb 2021 13:48:13 -0800


jorisvandenbossche commented on pull request #9504:
URL: https://github.com/apache/arrow/pull/9504#issuecomment-780876347



   Cool! 
   
   I don't have time right now to give it a more detailed review, but I quickly 
fetched the branch, and even for a not yet optimized first version, this is 
already *much* faster as the pure python pandas `to_csv` writer (in pandas only 
the csv reader is optimized, not the writer). 
   With a small example (50,000 rows, 5 columns, with floats/int/string), I get 
140ms with pandas, and 20ms with this branch (in release build) which even 
included the conversion pandas->arrow (but that's only 2-3ms in this case). 
   
   Few things I noticed / random thoughts:
   
   * When writing a column of floats that don't have decimals, no decimal point 
is included in the output, so it looks like an int (so eg `1` instead of `1.0`, 
pandas writes the latter). Not sure if we want to preserve this "type 
information" in this case.
   * Pandas doesn't use quoting by default, also for string columns. I am not 
fully sure what makes the most sense as default option, but disabling quoting 
can be a follow-up enhancement. 
   * We don't support casting timestamps to strings (yet), so that will be a 
useful addition to casting to be used here


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #9504: ARROW-2229: [C++][Python] Add WriteCsv functionality.

Reply via email to