Re: [C++] read Parquet columns into 64-bit offset types

Steve Kim Sun, 17 Jan 2021 21:18:36 -0800

> This should be possible already, at least on git master but perhaps also
> in 2.0.0.  Which problem are you encountering?


With pyarrow 2.0.0, I encountered the following:

```
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pyarrow.dataset as ds
>>> pa.__version__
'2.0.0'
>>> schema = pa.schema([pa.field("utf8", pa.utf8())])
>>> table = pa.Table.from_pydict({"utf8": ["foo", "bar"]}, schema)
>>> pq.write_table(table, "/tmp/example.parquet")
>>> large_schema = pa.schema([pa.field("utf8", pa.large_utf8())])
>>> ds.dataset("/tmp/example.parquet", schema=large_schema,
format="parquet").to_table()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 405, in
pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2262, in
pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 122, in
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: fields had matching names but differing types.
From: utf8: string To: utf8: large_string
```

I reproduced this behavior with pyarrow built from source on the master
branch (5f1be953).

Re: [C++] read Parquet columns into 64-bit offset types

Reply via email to