[ 
https://issues.apache.org/jira/browse/ARROW-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9297:
-----------------------------------------
    Fix Version/s: 1.0.0

> [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB)
> -------------------------------------------------------------------------
>
>                 Key: ARROW-9297
>                 URL: https://issues.apache.org/jira/browse/ARROW-9297
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Related to ARROW-3762 (the parquet issue which has been solved), and 
> discovered in ARROW-9139.
> When creating a Parquet file with a large binary column (larger than 
> BinaryArray capacity):
> {code}
> # code from the test_parquet.py::test_binary_array_overflow_to_chunked test
> values = [b'x'] + [ 
>     b'x' * (1 << 20) 
> ] * 2 * (1 << 10)                                                             
>                                                                               
>                                           
> table = pa.table({'byte_col': values})                                        
>                                                                               
>                                               
> pq.write_table(table, "test_large_binary.parquet")                            
>                                                                               
>                                               
> {code}
> then reading this with the parquet API works (fixed by ARROW-3762):
> {code}
> In [3]: pq.read_table("test_large_binary.parquet")                            
>                                                                               
>                               
> Out[3]: 
> pyarrow.Table
> byte_col: binary
> {code}
> but with the Datasets API this still fails:
> {code}
> In [1]: import pyarrow.dataset as ds                                          
>                                                                               
>                                                        
> In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet")   
>                                                                               
>                                                        
> In [4]: dataset.to_table()                                                    
>                                                                               
>                                                        
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> <ipython-input-4-6fb0d79c4511> in <module>
> ----> 1 dataset.to_table()
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: This class cannot yet iterate chunked arrays
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to