[ 
https://issues.apache.org/jira/browse/ARROW-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326559#comment-17326559
 ] 

Oleksandr Shevchenko commented on ARROW-12482:
----------------------------------------------

Thanks for the clarification!

> [Doc][Python] Mention CSVStreamingReader pitfalls with type inference
> ---------------------------------------------------------------------
>
>                 Key: ARROW-12482
>                 URL: https://issues.apache.org/jira/browse/ARROW-12482
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Documentation, Python
>    Affects Versions: 3.0.0
>            Reporter: Oleksandr Shevchenko
>            Priority: Major
>              Labels: CSVParser, CSVReader
>
> Looks like Arrow infer type for the first batch and apply it for all 
> subsequent batches. But information might be not enough to infer the type 
> correctly for the whole file. For our particular case, Arrow infers some 
> field in the schema as date32 from the first batch but the next batch has an 
> empty field value that can’t be converted to date32.
> When I increase the batch size to have such a value in the first batch Arrow 
> set string type (not sure why not nullable date32) for such a field since it 
> can’t be converted to date32 and the whole file is read successfully.
> This problem can be easily reproduced by using the following code and 
> attached dataset:
> {code:java}
> import pyarrow as pa
> import pyarrow._csv as pa_csv
> import pyarrow._fs as pa_fs
> read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
> parse_options: pa_csv.ParseOptions = 
> pa_csv.ParseOptions(newlines_in_values=True)
> convert_options: pa_csv.ConvertOptions = 
> pa_csv.ConvertOptions(timestamp_parsers=[''])
> with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
>  reader = pa_csv.open_csv(
>  file, read_options=read_options, parse_options=parse_options, 
> convert_options=convert_options
>  )
>  for batch in reader:
>  table_batch = pa.Table.from_batches([batch])
>  table_batch
> {code}
> Error message:
> {code:java}
>  for batch in reader:
>  File "pyarrow/ipc.pxi", line 497, in __iter__
>  File "pyarrow/ipc.pxi", line 531, in 
> pyarrow.lib.RecordBatchReader.read_next_batch
>  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to 
> date32[day]: invalid value ''
> {code}
>  
>  When we use block_size `10_000_000` file can be read successfully since we 
> have the problematic value in the first batch.
> An error occurs when I try to attach dataset, so you can download it from 
> Google Drive 
> [here|https://drive.google.com/file/d/1Vt1yN02dyVumsou_kFs7GTnKT46eE6ja/view?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to