[GitHub] [arrow] Meehai opened a new issue, #35661: PyArrow's csv reader yields different results than the default pandas

via GitHub Wed, 17 May 2023 23:49:56 -0700


Meehai opened a new issue, #35661:
URL: https://github.com/apache/arrow/issues/35661


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello, I've posted this issue on the pandas board as well, and they've asked 
me to put it here too:
   
   ```
   """
   user_id,value
   1225717802.1679841607,33
   """
   
   import pandas as pd
   a = pd.read_csv("bug.csv", dtype={"user_id": str})
   b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")
   
   print(a.user_id.iloc[0]) # 1225717802.1679841607
   print(b.user_id.iloc[0]) # 1225717802.1679842
   
   assert a.user_id.dtype == b.user_id.dtype # <- both are strings
   assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails
   ```
   
   It seems that under the hood `pyarrow.read_csv` handles strings (explicitly 
asked as strings) differently than expected, in the sense that there is some 
automatic conversion happening first before the explicit string conversion 
takes place. In this case it is first interpreted as a float, truncated because 
of precision issues and just then reconverted to string type.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Meehai opened a new issue, #35661: PyArrow's csv reader yields different results than the default pandas

Reply via email to