[
https://issues.apache.org/jira/browse/ARROW-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326510#comment-17326510
]
Oleksandr Shevchenko commented on ARROW-12482:
----------------------------------------------
Thanks for a quick reply [~apitrou]!
Could you also comment on the conversion error? Not sure why the empty value
can't be converted as null for the date32 type. I was trying to change
[null_values|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions.null_values]
and a bunch of other confs but didn't find anything which can help with this
particular case.
> [Doc][Python] Mention CSVStreamingReader pitfalls with type inference
> ---------------------------------------------------------------------
>
> Key: ARROW-12482
> URL: https://issues.apache.org/jira/browse/ARROW-12482
> Project: Apache Arrow
> Issue Type: Bug
> Components: Documentation, Python
> Affects Versions: 3.0.0
> Reporter: Oleksandr Shevchenko
> Priority: Major
> Labels: CSVParser, CSVReader
>
> Looks like Arrow infer type for the first batch and apply it for all
> subsequent batches. But information might be not enough to infer the type
> correctly for the whole file. For our particular case, Arrow infers some
> field in the schema as date32 from the first batch but the next batch has an
> empty field value that can’t be converted to date32.
> When I increase the batch size to have such a value in the first batch Arrow
> set string type (not sure why not nullable date32) for such a field since it
> can’t be converted to date32 and the whole file is read successfully.
> This problem can be easily reproduced by using the following code and
> attached dataset:
> {code:java}
> import pyarrow as pa
> import pyarrow._csv as pa_csv
> import pyarrow._fs as pa_fs
> read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
> parse_options: pa_csv.ParseOptions =
> pa_csv.ParseOptions(newlines_in_values=True)
> convert_options: pa_csv.ConvertOptions =
> pa_csv.ConvertOptions(timestamp_parsers=[''])
> with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
> reader = pa_csv.open_csv(
> file, read_options=read_options, parse_options=parse_options,
> convert_options=convert_options
> )
> for batch in reader:
> table_batch = pa.Table.from_batches([batch])
> table_batch
> {code}
> Error message:
> {code:java}
> for batch in reader:
> File "pyarrow/ipc.pxi", line 497, in __iter__
> File "pyarrow/ipc.pxi", line 531, in
> pyarrow.lib.RecordBatchReader.read_next_batch
> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to
> date32[day]: invalid value ''
> {code}
>
> When we use block_size `10_000_000` file can be read successfully since we
> have the problematic value in the first batch.
> An error occurs when I try to attach dataset, so you can download it from
> Google Drive
> [here|https://drive.google.com/file/d/1Vt1yN02dyVumsou_kFs7GTnKT46eE6ja/view?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)