Oleksandr Shevchenko created ARROW-12482:
--------------------------------------------
Summary: [Python] Inconsistent schema for CSVStreamingReader
batches
Key: ARROW-12482
URL: https://issues.apache.org/jira/browse/ARROW-12482
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0
Reporter: Oleksandr Shevchenko
Looks like Arrow infer type for the first batch and apply it for all subsequent
batches. But information might be not enough to infer the type correctly for
the whole file. For our particular case, Arrow infers some field in the schema
as date32 from the first batch but the next batch has an empty field value that
can’t be converted to date32.
When I increase the batch size to have such a value in the first batch Arrow
set string type (not sure why not nullable date32) for such a field since it
can’t be converted to date32 and the whole file is read successfully.
This problem can be easily reproduced by using the following code and attached
dataset:
{code:java}
read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
parse_options: pa_csv.ParseOptions =
pa_csv.ParseOptions(newlines_in_values=True)
convert_options: pa_csv.ConvertOptions =
pa_csv.ConvertOptions(timestamp_parsers=[''])
with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
reader = pa_csv.open_csv(
file, read_options=read_options, parse_options=parse_options,
convert_options=convert_options
)
for batch in reader:
table_batch = pa.Table.from_batches([batch])
table_batch
{code}
When we use block_size `10_000_000` file can be read successfully since we
have the problematic value in the first batch.
An error occurs when I try to attach dataset, so you can download it from
Google Drive
[here|https://drive.google.com/file/d/1Vt1yN02dyVumsou_kFs7GTnKT46eE6ja/view?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)