[
https://issues.apache.org/jira/browse/ARROW-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otávio Vasques updated ARROW-7727:
----------------------------------
Description:
I was trying to read a subset of my parquet files using the ParquetDataset
object with a predefined schema, when it tries to validate the schema a
`to_arrow_schema` is called and the schema does not support this. I don't what
is happening, this is a sample.
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
schema = pa.schema([
pa.field("field1", pa.string()),
pa.field("field2", pa.string()),
pa.field("field3", pa.string()),
])
...
pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)
AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
{code}
If we check the type of the schema as defined above we get:
{code:java}
type(schema)
pyarrow.lib.Schema{code}
But the required type according with the docs is `pyarrow.parquet.Schema`, I
don't know how to produce a object with this since we are forbbiden to use the
Schema constructor directly.
If we check the implementation on github we get directly this line
[here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
{code:java}
dataset_schema = self.schema.to_arrow_schema(){code}
Is this a problem in the schema builder or the parquet dataset object?
was:
I was trying to read a subset of my parquet files using the ParquetDataset
object with a predefined schema, when it tries to validate the schema a
`to_arrow_schema` is called and the schema does not support this. I don't what
is happening, this is a sample:
``` python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
schema = pa.schema([
pa.field("field1", pa.string()),
pa.field("field2", pa.string()),
pa.field("field3", pa.string()),
])
...
pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)
AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
```
If we check the type of the schema as defined above we get:
```
type(schema)
pyarrow.lib.Schema
```
But the required type according with the docs is `pyarrow.parquet.Schema`, I
don't know how to produce a object with this since we are forbbiden to use the
Schema constructor directly.
If we check the implementation on github we get directly this line
[here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
```
dataset_schema = self.schema.to_arrow_schema()
```
Is this a problem in the schema builder or the parquet dataset object?
> Unable to read a ParquetDataset when schema validation is on.
> -------------------------------------------------------------
>
> Key: ARROW-7727
> URL: https://issues.apache.org/jira/browse/ARROW-7727
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: _libgcc_mutex 0.1
> main
> arrow-cpp 0.15.1 py37h982ac2c_6 conda-forge
> attrs 19.3.0 py_0 conda-forge
> backcall 0.1.0 py_0 conda-forge
> bleach 3.1.0 py_0 conda-forge
> boost-cpp 1.70.0 h8e57a91_2 conda-forge
> brotli 1.0.7 he1b5a44_1000 conda-forge
> bzip2 1.0.8 h516909a_2 conda-forge
> c-ares 1.15.0 h516909a_1001 conda-forge
> ca-certificates 2019.11.28 hecc5488_0 conda-forge
> certifi 2019.11.28 py37_0 conda-forge
> decorator 4.4.1 py_0 conda-forge
> defusedxml 0.6.0 py_0 conda-forge
> double-conversion 3.1.5 he1b5a44_2 conda-forge
> entrypoints 0.3 py37_1000 conda-forge
> gflags 2.2.2 he1b5a44_1002 conda-forge
> glog 0.4.0 he1b5a44_1 conda-forge
> grpc-cpp 1.25.0 h213be95_2 conda-forge
> icu 64.2 he1b5a44_1 conda-forge
> importlib_metadata 1.4.0 py37_0 conda-forge
> inflect 4.0.0 py37_1 conda-forge
> ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge
> ipython 7.11.1 py37h5ca1d4c_0 conda-forge
> ipython_genutils 0.2.0 py_1 conda-forge
> jaraco.itertools 5.0.0 py_0 conda-forge
> jedi 0.16.0 py37_0 conda-forge
> jinja2 2.10.3 py_0 conda-forge
> jsonschema 3.2.0 py37_0 conda-forge
> jupyter_client 5.3.4 py37_1 conda-forge
> jupyter_core 4.6.1 py37_0 conda-forge
> ld_impl_linux-64 2.33.1 h53a641e_7
> libblas 3.8.0 14_openblas conda-forge
> libcblas 3.8.0 14_openblas conda-forge
> libedit 3.1.20181209 hc058e9b_0
> libevent 2.1.10 h72c5cf5_0 conda-forge
> libffi 3.2.1 hd88cf55_4
> libgcc-ng 9.1.0 hdf63c60_0
> libgfortran-ng 7.3.0 hdf63c60_4 conda-forge
> liblapack 3.8.0 14_openblas conda-forge
> libopenblas 0.3.7 h5ec1e0e_6 conda-forge
> libprotobuf 3.11.0 h8b12597_0 conda-forge
> libsodium 1.0.17 h516909a_0 conda-forge
> libstdcxx-ng 9.1.0 hdf63c60_0
> lz4-c 1.8.3 he1b5a44_1001 conda-forge
> markupsafe 1.1.1 py37h516909a_0 conda-forge
> mistune 0.8.4 py37h516909a_1000 conda-forge
> more-itertools 8.1.0 py_0 conda-forge
> nbconvert 5.6.1 py37_0 conda-forge
> nbformat 5.0.4 py_0 conda-forge
> ncurses 6.1 he6710b0_1
> notebook 6.0.3 py37_0 conda-forge
> numpy 1.17.5 py37h95a1406_0 conda-forge
> openssl 1.1.1d h516909a_0 conda-forge
> pandas 0.25.3 py37hb3f55d8_0 conda-forge
> pandoc 2.9.1.1 0 conda-forge
> pandocfilters 1.4.2 py_1 conda-forge
> parquet-cpp 1.5.1 2 conda-forge
> parso 0.6.0 py_0 conda-forge
> pexpect 4.8.0 py37_0 conda-forge
> pickleshare 0.7.5 py37_1000 conda-forge
> pip 20.0.2 py37_0
> prometheus_client 0.7.1 py_0 conda-forge
> prompt_toolkit 3.0.2 py_0 conda-forge
> ptyprocess 0.6.0 py_1001 conda-forge
> pyarrow 0.15.1 py37h8b68381_1 conda-forge
> pygments 2.5.2 py_0 conda-forge
> pyrsistent 0.15.7 py37h516909a_0 conda-forge
> python 3.7.6 h0371630_2
> python-dateutil 2.8.1 py_0 conda-forge
> pytz 2019.3 py_0 conda-forge
> pyzmq 18.1.1 py37h1768529_0 conda-forge
> re2 2020.01.01 he1b5a44_0 conda-forge
> readline 7.0 h7b6447c_5
> send2trash 1.5.0 py_0 conda-forge
> setuptools 45.1.0 py37_0
> six 1.14.0 py37_0 conda-forge
> snappy 1.1.7 he1b5a44_1003 conda-forge
> sqlite 3.30.1 h7b6447c_0
> terminado 0.8.3 py37_0 conda-forge
> testpath 0.4.4 py_0 conda-forge
> thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
> tk 8.6.8 hbc83047_0
> tornado 6.0.3 py37h516909a_0 conda-forge
> traitlets 4.3.3 py37_0 conda-forge
> uriparser 0.9.3 he1b5a44_1 conda-forge
> wcwidth 0.1.8 py_0 conda-forge
> webencodings 0.5.1 py_1 conda-forge
> wheel 0.33.6 py37_0
> xz 5.2.4 h14c3975_4
> zeromq 4.3.2 he1b5a44_2 conda-forge
> zipp 2.1.0 py_0 conda-forge
> zlib 1.2.11 h7b6447c_3
> zstd 1.4.4 h3b9ef0a_1 conda-forge
> Reporter: Otávio Vasques
> Priority: Major
> Fix For: 0.16.0
>
>
> I was trying to read a subset of my parquet files using the ParquetDataset
> object with a predefined schema, when it tries to validate the schema a
> `to_arrow_schema` is called and the schema does not support this. I don't
> what is happening, this is a sample.
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import numpy as np
> schema = pa.schema([
> pa.field("field1", pa.string()),
> pa.field("field2", pa.string()),
> pa.field("field3", pa.string()),
> ])
> ...
> pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)
> AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
> {code}
> If we check the type of the schema as defined above we get:
> {code:java}
> type(schema)
> pyarrow.lib.Schema{code}
> But the required type according with the docs is `pyarrow.parquet.Schema`, I
> don't know how to produce a object with this since we are forbbiden to use
> the Schema constructor directly.
> If we check the implementation on github we get directly this line
> [here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
> {code:java}
> dataset_schema = self.schema.to_arrow_schema(){code}
> Is this a problem in the schema builder or the parquet dataset object?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)