[GitHub] [arrow] westonpace commented on issue #34010: [Python] Dataset Schema Infer Depth - Maximum Number of Rows

via GitHub Thu, 02 Feb 2023 16:34:09 -0800


westonpace commented on issue #34010:
URL: https://github.com/apache/arrow/issues/34010#issuecomment-1414553376


   It is not obvious but it is possible.  The `block_size` (in 
[`pyarrow.csv.ReadOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions)
 is what actually determines our inference depth).  In order to specify a 
custom read options you will need to create a 
[`pyarrow.dataset.CsvFileFormat`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.CsvFileFormat.html#pyarrow.dataset.CsvFileFormat).
   
   Regrettably the inference depth is always somewhat tied in with our I/O 
performance.  However, I suspect you can bump up the default quite a bit before 
you start to notice significant effects.
   
    A complete example:
   
   ```
   import pyarrow as pa
   import pyarrow.csv as csv
   import pyarrow.dataset as ds
   
   MiB = 1024*1024
   
   read_options = csv.ReadOptions(block_size=16*MiB) # Note, the default is 
1MiB                                                                            
                                                          
   csv_format = ds.CsvFileFormat(read_options=read_options)
   
   my_dataset = ds.dataset('/tmp/my_dataset', format=csv_format)
   print(my_dataset.to_table())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #34010: [Python] Dataset Schema Infer Depth - Maximum Number of Rows

Reply via email to