[ https://issues.apache.org/jira/browse/ARROW-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-9297: ----------------------------------------- Fix Version/s: 1.0.0 > [C++][Dataset] Dataset scanner cannot handle large binary column (> 2 GB) > ------------------------------------------------------------------------- > > Key: ARROW-9297 > URL: https://issues.apache.org/jira/browse/ARROW-9297 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Fix For: 1.0.0 > > > Related to ARROW-3762 (the parquet issue which has been solved), and > discovered in ARROW-9139. > When creating a Parquet file with a large binary column (larger than > BinaryArray capacity): > {code} > # code from the test_parquet.py::test_binary_array_overflow_to_chunked test > values = [b'x'] + [ > b'x' * (1 << 20) > ] * 2 * (1 << 10) > > > table = pa.table({'byte_col': values}) > > > pq.write_table(table, "test_large_binary.parquet") > > > {code} > then reading this with the parquet API works (fixed by ARROW-3762): > {code} > In [3]: pq.read_table("test_large_binary.parquet") > > > Out[3]: > pyarrow.Table > byte_col: binary > {code} > but with the Datasets API this still fails: > {code} > In [1]: import pyarrow.dataset as ds > > > In [2]: dataset = ds.dataset("test_large_binary.parquet", format="parquet") > > > In [4]: dataset.to_table() > > > --------------------------------------------------------------------------- > ArrowNotImplementedError Traceback (most recent call last) > <ipython-input-4-6fb0d79c4511> in <module> > ----> 1 dataset.to_table() > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset.Dataset.to_table() > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset.Scanner.to_table() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowNotImplementedError: This class cannot yet iterate chunked arrays > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)