[ 
https://issues.apache.org/jira/browse/ARROW-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271190#comment-17271190
 ] 

Joris Van den Bossche commented on ARROW-11353:
-----------------------------------------------

Note that in general, we don't have any functionality exposed in the python 
parquet bindings to override the type used when reading specific columns (I 
don't know to what extent this actually exists in the C++ layer). So this is 
not an issue specific to the large types. Now for the large types, it might be 
that this can be done more efficiently in the C++ layer than "just" casting 
afterwards?

For the Dataset API (what the example above is using): this is a known 
limitation right now that the schema has to match exactly, and almost no schema 
evolution is allowed (only from null -> any type). See ARROW-11003 for an 
umbrella issue for this (eg we also want to allow to read int32 columns as 
int64).

> [C++][Python][Parquet] We should allow for overriding to large types by 
> providing a schema
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11353
>                 URL: https://issues.apache.org/jira/browse/ARROW-11353
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Micah Kornfield
>            Priority: Major
>
> {{The following shouldn't throw}}
> {{>>> import pyarrow as pa}}
> {{>>> import pyarrow.parquet as pq}}
> {{>>> import pyarrow.dataset as ds}}
> {{>>> pa.__version__}}
> {{'2.0.0'}}
> {{>>> schema = pa.schema([pa.field("utf8", pa.utf8())])}}
> {{>>> table = pa.Table.from_pydict(\{"utf8": ["foo", "bar"]}, schema)}}
> {{>>> pq.write_table(table, "/tmp/example.parquet")}}
> {{>>> large_schema = pa.schema([pa.field("utf8", pa.large_utf8())])}}
> {{>>> ds.dataset("/tmp/example.parquet", schema=large_schema,}}
> {{format="parquet").to_table()}}
> {{Traceback (most recent call last):}}
> {{  File "<stdin>", line 1, in <module>}}
> {{  File "pyarrow/_dataset.pyx", line 405, in}}
> {{pyarrow._dataset.Dataset.to_table}}
> {{  File "pyarrow/_dataset.pyx", line 2262, in}}
> {{pyarrow._dataset.Scanner.to_table}}
> {{  File "pyarrow/error.pxi", line 122, in}}
> {{pyarrow.lib.pyarrow_internal_check_status}}
> {{  File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status}}
> {{pyarrow.lib.ArrowTypeError: fields had matching names but differing types.}}
> {{From: utf8: string To: utf8: large_string}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to