[GitHub] [arrow] jorisvandenbossche commented on issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas

via GitHub Tue, 23 May 2023 09:07:47 -0700


jorisvandenbossche commented on issue #35661:
URL: https://github.com/apache/arrow/issues/35661#issuecomment-1559745327


   The behaviour you notice is indeed from casting what has been read/parsed as 
a float afterwards to string. However, if you use pyarrow's csv reader directly 
and using the column_types argument, this is done properly:
   
   ```
   >>> from pyarrow import csv
   >>> csv.read_csv("bug.csv")
   pyarrow.Table
   user_id: double
   value: int64
   ----
   user_id: [[1225717802.1679842]]
   value: [[33]]
   
   >>> csv.read_csv("bug.csv", 
convert_options=csv.ConvertOptions(column_types={"user_id": pa.string()}))
   pyarrow.Table
   user_id: string
   value: int64
   ----
   user_id: [["1225717802.1679841607"]]
   value: [[33]]
   ```
   
   So I assume this is actually a bug in pandas after all (in how pandas 
integrates with the pyarrow csv reader, and how it translates its own arguments 
to arguments passed to pyarrow). Therefore closing this issue, and will re-open 
the one on the pandas side (https://github.com/pandas-dev/pandas/issues/53269)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35661: [C++] PyArrow's csv reader yields different results than the default pandas

Reply via email to