[ 
https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323792#comment-17323792
 ] 

Uwe Korn commented on ARROW-12420:
----------------------------------

cc [~bkietz] who wrote the PR that broke it ;)

> [C++/Dataset] Reading null columns as dictionary not longer possible
> --------------------------------------------------------------------
>
>                 Key: ARROW-12420
>                 URL: https://issues.apache.org/jira/browse/ARROW-12420
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 4.0.0
>            Reporter: Uwe Korn
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Reading a dataset with a dictionary column where some of the files don't 
> contain any data for that column (and thus are typed as null) broke with 
> https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release 
> though and thus I would consider this a regression.
> This can be reproduced using the following Python snippet:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({"a": [None, None]})
> pq.write_table(table, "test.parquet")
> schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
> fsds = ds.FileSystemDataset.from_paths(
>     paths=["test.parquet"],
>     schema=schema,
>     format=pa.dataset.ParquetFileFormat(),
>     filesystem=pa.fs.LocalFileSystem(),
> )
> fsds.to_table()
> {code}
> The exception on master is currently:
> {code}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> <ipython-input-14-5f0bc602f16b> in <module>
>       6     filesystem=pa.fs.LocalFileSystem(),
>       7 )
> ----> 8 fsds.to_table()
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
>     456         table : Table instance
>     457         """
> --> 458         return self._scanner(**kwargs).to_table()
>     459 
>     460     def head(self, int num_rows, **kwargs):
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
>    2887             result = self.scanner.ToTable()
>    2888 
> -> 2889         return pyarrow_wrap_table(GetResultValue(result))
>    2890 
>    2891     def take(self, object indices):
> ~/Development/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
>     139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
>     140         nogil except -1:
> --> 141     return check_status(status)
>     142 
>     143 
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
>     116             raise ArrowKeyError(message)
>     117         elif status.IsNotImplemented():
> --> 118             raise ArrowNotImplementedError(message)
>     119         elif status.IsTypeError():
>     120             raise ArrowTypeError(message)
> ArrowNotImplementedError: Unsupported cast from null to 
> dictionary<values=string, indices=int32, ordered=0> (no available cast 
> function for target type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to