[GitHub] [arrow] earlev4 opened a new issue, #34010: PyArrow Dataset Schema Infer Depth - Maximum Number of Rows

via GitHub Thu, 02 Feb 2023 10:50:04 -0800


earlev4 opened a new issue, #34010:
URL: https://github.com/apache/arrow/issues/34010

### Describe the usage question you have. Please include as many useful
details as possible.

Hello! First off, a huge thank you to the contributors of the Arrow project!
I am grateful for the project.

Please forgive me if this question has been asked before and addressed. I
tried to do my due diligence in researching the issue (read PyArrow
documentation - Tabular Datasets, Python API, Python Cookbook, etc. and also
searched GitHub issues) before asking this question. Perhaps, I missed it. I
am using the Python API and specifically the `dataset()` function to read a
collection of CSV files. From reading the documentation, I understand by not
providing a schema to the `schema` parameter in the `dataset()` function, the
schema is inferred. Unfortunately, it appears that the schema inference is not
going deep enough into the CSV file to detect the appropriate data types due to
missing values in the first thousands of rows. I understand that I can define a
schema and pass that to the `schema` parameter, however with 50, 100, or more
columns, defining a schema can be extensive.

Here is an example:
```python
my_ds = ds.dataset('./path/to/csv/files', format='csv')
my_ds.head(5)
```

Error example:
```python
ArrowInvalid: In CSV column #100: Row #5000: CSV conversion error to null:
invalid value 'ABC'.
```

I believe the error occurs because the rows used to infer the schema
determine that column 100 has null values and when row 5000 is reached with a
string value of `ABC`, the value is determined invalid.

I know that libraries such as `polars` provide a `infer_schema_length`
parameter for the
[`read_csv`](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html)
function, where the maximum number of rows to read to infer schema can be
defined.

Is there something similar in PyArrow, where the maximum number of rows can
be defined to infer the schema for the `dataset` function?

Thanks in advance! I sincerely appreciate any guidance.

### Component(s)

Python

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] earlev4 opened a new issue, #34010: PyArrow Dataset Schema Infer Depth - Maximum Number of Rows

Reply via email to