[ 
https://issues.apache.org/jira/browse/ARROW-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksandr Shevchenko updated ARROW-12482:
-----------------------------------------
    Description: 
Looks like Arrow infer type for the first batch and apply it for all subsequent 
batches. But information might be not enough to infer the type correctly for 
the whole file. For our particular case, Arrow infers some field in the schema 
as date32 from the first batch but the next batch has an empty field value that 
can’t be converted to date32.

When I increase the batch size to have such a value in the first batch Arrow 
set string type (not sure why not nullable date32) for such a field since it 
can’t be converted to date32 and the whole file is read successfully.

This problem can be easily reproduced by using the following code and attached 
dataset:
{code:java}
read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
parse_options: pa_csv.ParseOptions = 
pa_csv.ParseOptions(newlines_in_values=True)
convert_options: pa_csv.ConvertOptions = 
pa_csv.ConvertOptions(timestamp_parsers=[''])
with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
 reader = pa_csv.open_csv(
 file, read_options=read_options, parse_options=parse_options, 
convert_options=convert_options
 )
 for batch in reader:
 table_batch = pa.Table.from_batches([batch])
 table_batch
{code}
Error message:
{code:java}
 for batch in reader:
 File "pyarrow/ipc.pxi", line 497, in __iter__
 File "pyarrow/ipc.pxi", line 531, in 
pyarrow.lib.RecordBatchReader.read_next_batch
 File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
 pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to 
date32[day]: invalid value ''
{code}
 
 When we use block_size `10_000_000` file can be read successfully since we 
have the problematic value in the first batch.

An error occurs when I try to attach dataset, so you can download it from 
Google Drive 
[here|https://drive.google.com/file/d/1Vt1yN02dyVumsou_kFs7GTnKT46eE6ja/view?usp=sharing]

  was:
Looks like Arrow infer type for the first batch and apply it for all subsequent 
batches. But information might be not enough to infer the type correctly for 
the whole file. For our particular case, Arrow infers some field in the schema 
as date32 from the first batch but the next batch has an empty field value that 
can’t be converted to date32.

When I increase the batch size to have such a value in the first batch Arrow 
set string type (not sure why not nullable date32) for such a field since it 
can’t be converted to date32 and the whole file is read successfully.

This problem can be easily reproduced by using the following code and attached 
dataset:
{code:java}
read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
parse_options: pa_csv.ParseOptions = 
pa_csv.ParseOptions(newlines_in_values=True)
convert_options: pa_csv.ConvertOptions = 
pa_csv.ConvertOptions(timestamp_parsers=[''])
with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
 reader = pa_csv.open_csv(
 file, read_options=read_options, parse_options=parse_options, 
convert_options=convert_options
 )
 for batch in reader:
 table_batch = pa.Table.from_batches([batch])
 table_batch
{code}
 When we use block_size `10_000_000` file can be read successfully since we 
have the problematic value in the first batch.

An error occurs when I try to attach dataset, so you can download it from 
Google Drive 
[here|https://drive.google.com/file/d/1Vt1yN02dyVumsou_kFs7GTnKT46eE6ja/view?usp=sharing]


> [Python] Inconsistent schema for CSVStreamingReader batches
> -----------------------------------------------------------
>
>                 Key: ARROW-12482
>                 URL: https://issues.apache.org/jira/browse/ARROW-12482
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Oleksandr Shevchenko
>            Priority: Major
>              Labels: CSVParser, CSVReader
>
> Looks like Arrow infer type for the first batch and apply it for all 
> subsequent batches. But information might be not enough to infer the type 
> correctly for the whole file. For our particular case, Arrow infers some 
> field in the schema as date32 from the first batch but the next batch has an 
> empty field value that can’t be converted to date32.
> When I increase the batch size to have such a value in the first batch Arrow 
> set string type (not sure why not nullable date32) for such a field since it 
> can’t be converted to date32 and the whole file is read successfully.
> This problem can be easily reproduced by using the following code and 
> attached dataset:
> {code:java}
> read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
> parse_options: pa_csv.ParseOptions = 
> pa_csv.ParseOptions(newlines_in_values=True)
> convert_options: pa_csv.ConvertOptions = 
> pa_csv.ConvertOptions(timestamp_parsers=[''])
> with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
>  reader = pa_csv.open_csv(
>  file, read_options=read_options, parse_options=parse_options, 
> convert_options=convert_options
>  )
>  for batch in reader:
>  table_batch = pa.Table.from_batches([batch])
>  table_batch
> {code}
> Error message:
> {code:java}
>  for batch in reader:
>  File "pyarrow/ipc.pxi", line 497, in __iter__
>  File "pyarrow/ipc.pxi", line 531, in 
> pyarrow.lib.RecordBatchReader.read_next_batch
>  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to 
> date32[day]: invalid value ''
> {code}
>  
>  When we use block_size `10_000_000` file can be read successfully since we 
> have the problematic value in the first batch.
> An error occurs when I try to attach dataset, so you can download it from 
> Google Drive 
> [here|https://drive.google.com/file/d/1Vt1yN02dyVumsou_kFs7GTnKT46eE6ja/view?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to