Re: [I] [Python] RecordBatchReader.cast() does not work when reading CSV file. [arrow]

via GitHub Thu, 24 Apr 2025 10:57:03 -0700


davlee1972 commented on issue #46223:
URL: https://github.com/apache/arrow/issues/46223#issuecomment-2828452101


   That's because the first batch is all 1s so the initial schema it generates 
for the recordbatchreader is Int64 for the column.
   
   It fails to read the batch with "A" in it as a Int64 which happens before 
cast().
   
   The following works..
   Explicitly map column names to column types. Passing this argument disables 
type inference on the defined columns.
   
   ```
   with open("/tmp/mixed.csv", "w") as f:
       f.write("mixed_column\n")
       f.write("1\n" * 1000000)
       f.write("A\n")
   
   import pyarrow.csv 
   import pyarrow.lib
   import pyarrow
   
   def cast_test():
       c_options = pyarrow.csv.ConvertOptions(column_types={"mixed_column": 
"string"})
       print(c_options)
       with pyarrow.csv.open_csv('/tmp/mixed.csv', convert_options=c_options) 
as r:
           for batch in r:
               print(batch)
   
   cast_test()
   ```
   Personally I use pyarrow.dataset() for everything instead of the csv, 
parquet, etc. classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Python] RecordBatchReader.cast() does not work when reading CSV file. [arrow]

Reply via email to