davlee1972 commented on issue #46223:
URL: https://github.com/apache/arrow/issues/46223#issuecomment-2828452101
That's because the first batch is all 1s so the initial schema it generates
for the recordbatchreader is Int64 for the column.
It fails to read the batch with "A" in it as a Int64 which happens before
cast().
The following works..
Explicitly map column names to column types. Passing this argument disables
type inference on the defined columns.
```
with open("/tmp/mixed.csv", "w") as f:
f.write("mixed_column\n")
f.write("1\n" * 1000000)
f.write("A\n")
import pyarrow.csv
import pyarrow.lib
import pyarrow
def cast_test():
c_options = pyarrow.csv.ConvertOptions(column_types={"mixed_column":
"string"})
print(c_options)
with pyarrow.csv.open_csv('/tmp/mixed.csv', convert_options=c_options)
as r:
for batch in r:
print(batch)
cast_test()
```
Personally I use pyarrow.dataset() for everything instead of the csv,
parquet, etc. classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]