Meehai opened a new issue, #35661:
URL: https://github.com/apache/arrow/issues/35661
### Describe the bug, including details regarding any error messages,
version, and platform.
Hello, I've posted this issue on the pandas board as well, and they've asked
me to put it here too:
```
"""
user_id,value
1225717802.1679841607,33
"""
import pandas as pd
a = pd.read_csv("bug.csv", dtype={"user_id": str})
b = pd.read_csv("bug.csv", dtype={"user_id": str}, engine="pyarrow")
print(a.user_id.iloc[0]) # 1225717802.1679841607
print(b.user_id.iloc[0]) # 1225717802.1679842
assert a.user_id.dtype == b.user_id.dtype # <- both are strings
assert a.user_id.iloc[0] == b.user_id.iloc[0] # <- this fails
```
It seems that under the hood `pyarrow.read_csv` handles strings (explicitly
asked as strings) differently than expected, in the sense that there is some
automatic conversion happening first before the explicit string conversion
takes place. In this case it is first interpreted as a float, truncated because
of precision issues and just then reconverted to string type.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]