earlev4 opened a new issue, #34010:
URL: https://github.com/apache/arrow/issues/34010

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello! First off, a huge thank you to the contributors of the Arrow project! 
I am grateful for the project.
   
   Please forgive me if this question has been asked before and addressed. I 
tried to do my due diligence in researching the issue (read PyArrow 
documentation - Tabular Datasets, Python API, Python Cookbook, etc. and also 
searched GitHub issues)  before asking this question. Perhaps, I missed it. I 
am using the Python API and specifically the `dataset()` function to read a 
collection of CSV files. From reading the documentation, I understand by not 
providing a schema to the `schema` parameter in the `dataset()` function, the 
schema is inferred. Unfortunately, it appears that the schema inference is not 
going deep enough into the CSV file to detect the appropriate data types due to 
missing values in the first thousands of rows. I understand that I can define a 
schema and pass that to the `schema` parameter, however with 50, 100, or more 
columns, defining a schema can be extensive.
   
   Here is an example:  
   ```python
   my_ds = ds.dataset('./path/to/csv/files', format='csv')
   my_ds.head(5)
   ```
   
   Error example:
   ```python
   ArrowInvalid: In CSV column #100: Row #5000: CSV conversion error to null: 
invalid value 'ABC'.
   ```
   
   I believe the error occurs because the rows used to infer the schema 
determine that column 100 has null values and when row 5000 is reached with a 
string value of `ABC`, the value is determined invalid.
   
   I know that libraries such as `polars` provide a `infer_schema_length` 
parameter for the 
[`read_csv`](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html)
 function, where the maximum number of rows to read to infer schema can be 
defined.
   
   Is there something similar in PyArrow, where the maximum number of rows can 
be defined to infer the schema for the `dataset` function?
   
   Thanks in advance! I sincerely appreciate any guidance.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to