[jira] [Updated] (ARROW-7727) Unable to read a ParquetDataset when schema validation is on.

Jira Thu, 30 Jan 2020 06:52:24 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Otávio Vasques updated ARROW-7727:
----------------------------------
    Description: 
I was trying to read a subset of my parquet files using the ParquetDataset 
object with a predefined schema, when it tries to validate the schema a 
`to_arrow_schema` is called and the schema does not support this. I don't what 
is happening, this is a sample. 
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

schema = pa.schema([
    pa.field("field1", pa.string()),
    pa.field("field2", pa.string()),
    pa.field("field3", pa.string()),
])

 ...

pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
{code}
If we check the type of the schema as defined above we get:
{code:java}
type(schema)
pyarrow.lib.Schema{code}
But the required type according with the docs is `pyarrow.parquet.Schema`, I 
don't know how to produce a object with this since we are forbbiden to use the 
Schema constructor directly.

If we check the implementation on github we get directly this line 
[here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
{code:java}
dataset_schema = self.schema.to_arrow_schema(){code}
Is this a problem in the schema builder or the parquet dataset object?

  was:
I was trying to read a subset of my parquet files using the ParquetDataset 
object with a predefined schema, when it tries to validate the schema a 
`to_arrow_schema` is called and the schema does not support this. I don't what 
is happening, this is a sample:

 

``` python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

schema = pa.schema([
    pa.field("field1", pa.string()),
    pa.field("field2", pa.string()),
    pa.field("field3", pa.string()),
])

 ...

pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
```

If we check the type of the schema as defined above we get:
```
type(schema)
pyarrow.lib.Schema
```
But the required type according with the docs is `pyarrow.parquet.Schema`, I 
don't know how to produce a object with this since we are forbbiden to use the 
Schema constructor directly.

If we check the implementation on github we get directly this line 
[here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
 
```
dataset_schema = self.schema.to_arrow_schema()
```

Is this a problem in the schema builder or the parquet dataset object?


> Unable to read a ParquetDataset when schema validation is on.
> -------------------------------------------------------------
>
>                 Key: ARROW-7727
>                 URL: https://issues.apache.org/jira/browse/ARROW-7727
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: _libgcc_mutex             0.1                        
> main  
> arrow-cpp                 0.15.1           py37h982ac2c_6    conda-forge
> attrs                     19.3.0                     py_0    conda-forge
> backcall                  0.1.0                      py_0    conda-forge
> bleach                    3.1.0                      py_0    conda-forge
> boost-cpp                 1.70.0               h8e57a91_2    conda-forge
> brotli                    1.0.7             he1b5a44_1000    conda-forge
> bzip2                     1.0.8                h516909a_2    conda-forge
> c-ares                    1.15.0            h516909a_1001    conda-forge
> ca-certificates           2019.11.28           hecc5488_0    conda-forge
> certifi                   2019.11.28               py37_0    conda-forge
> decorator                 4.4.1                      py_0    conda-forge
> defusedxml                0.6.0                      py_0    conda-forge
> double-conversion         3.1.5                he1b5a44_2    conda-forge
> entrypoints               0.3                   py37_1000    conda-forge
> gflags                    2.2.2             he1b5a44_1002    conda-forge
> glog                      0.4.0                he1b5a44_1    conda-forge
> grpc-cpp                  1.25.0               h213be95_2    conda-forge
> icu                       64.2                 he1b5a44_1    conda-forge
> importlib_metadata        1.4.0                    py37_0    conda-forge
> inflect                   4.0.0                    py37_1    conda-forge
> ipykernel                 5.1.4            py37h5ca1d4c_0    conda-forge
> ipython                   7.11.1           py37h5ca1d4c_0    conda-forge
> ipython_genutils          0.2.0                      py_1    conda-forge
> jaraco.itertools          5.0.0                      py_0    conda-forge
> jedi                      0.16.0                   py37_0    conda-forge
> jinja2                    2.10.3                     py_0    conda-forge
> jsonschema                3.2.0                    py37_0    conda-forge
> jupyter_client            5.3.4                    py37_1    conda-forge
> jupyter_core              4.6.1                    py37_0    conda-forge
> ld_impl_linux-64          2.33.1               h53a641e_7  
> libblas                   3.8.0               14_openblas    conda-forge
> libcblas                  3.8.0               14_openblas    conda-forge
> libedit                   3.1.20181209         hc058e9b_0  
> libevent                  2.1.10               h72c5cf5_0    conda-forge
> libffi                    3.2.1                hd88cf55_4  
> libgcc-ng                 9.1.0                hdf63c60_0  
> libgfortran-ng            7.3.0                hdf63c60_4    conda-forge
> liblapack                 3.8.0               14_openblas    conda-forge
> libopenblas               0.3.7                h5ec1e0e_6    conda-forge
> libprotobuf               3.11.0               h8b12597_0    conda-forge
> libsodium                 1.0.17               h516909a_0    conda-forge
> libstdcxx-ng              9.1.0                hdf63c60_0  
> lz4-c                     1.8.3             he1b5a44_1001    conda-forge
> markupsafe                1.1.1            py37h516909a_0    conda-forge
> mistune                   0.8.4           py37h516909a_1000    conda-forge
> more-itertools            8.1.0                      py_0    conda-forge
> nbconvert                 5.6.1                    py37_0    conda-forge
> nbformat                  5.0.4                      py_0    conda-forge
> ncurses                   6.1                  he6710b0_1  
> notebook                  6.0.3                    py37_0    conda-forge
> numpy                     1.17.5           py37h95a1406_0    conda-forge
> openssl                   1.1.1d               h516909a_0    conda-forge
> pandas                    0.25.3           py37hb3f55d8_0    conda-forge
> pandoc                    2.9.1.1                       0    conda-forge
> pandocfilters             1.4.2                      py_1    conda-forge
> parquet-cpp               1.5.1                         2    conda-forge
> parso                     0.6.0                      py_0    conda-forge
> pexpect                   4.8.0                    py37_0    conda-forge
> pickleshare               0.7.5                 py37_1000    conda-forge
> pip                       20.0.2                   py37_0  
> prometheus_client         0.7.1                      py_0    conda-forge
> prompt_toolkit            3.0.2                      py_0    conda-forge
> ptyprocess                0.6.0                   py_1001    conda-forge
> pyarrow                   0.15.1           py37h8b68381_1    conda-forge
> pygments                  2.5.2                      py_0    conda-forge
> pyrsistent                0.15.7           py37h516909a_0    conda-forge
> python                    3.7.6                h0371630_2  
> python-dateutil           2.8.1                      py_0    conda-forge
> pytz                      2019.3                     py_0    conda-forge
> pyzmq                     18.1.1           py37h1768529_0    conda-forge
> re2                       2020.01.01           he1b5a44_0    conda-forge
> readline                  7.0                  h7b6447c_5  
> send2trash                1.5.0                      py_0    conda-forge
> setuptools                45.1.0                   py37_0  
> six                       1.14.0                   py37_0    conda-forge
> snappy                    1.1.7             he1b5a44_1003    conda-forge
> sqlite                    3.30.1               h7b6447c_0  
> terminado                 0.8.3                    py37_0    conda-forge
> testpath                  0.4.4                      py_0    conda-forge
> thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
> tk                        8.6.8                hbc83047_0  
> tornado                   6.0.3            py37h516909a_0    conda-forge
> traitlets                 4.3.3                    py37_0    conda-forge
> uriparser                 0.9.3                he1b5a44_1    conda-forge
> wcwidth                   0.1.8                      py_0    conda-forge
> webencodings              0.5.1                      py_1    conda-forge
> wheel                     0.33.6                   py37_0  
> xz                        5.2.4                h14c3975_4  
> zeromq                    4.3.2                he1b5a44_2    conda-forge
> zipp                      2.1.0                      py_0    conda-forge
> zlib                      1.2.11               h7b6447c_3  
> zstd                      1.4.4                h3b9ef0a_1    conda-forge
>            Reporter: Otávio Vasques
>            Priority: Major
>             Fix For: 0.16.0
>
>
> I was trying to read a subset of my parquet files using the ParquetDataset 
> object with a predefined schema, when it tries to validate the schema a 
> `to_arrow_schema` is called and the schema does not support this. I don't 
> what is happening, this is a sample. 
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import numpy as np
> schema = pa.schema([
>     pa.field("field1", pa.string()),
>     pa.field("field2", pa.string()),
>     pa.field("field3", pa.string()),
> ])
>  ...
> pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)
> AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
> {code}
> If we check the type of the schema as defined above we get:
> {code:java}
> type(schema)
> pyarrow.lib.Schema{code}
> But the required type according with the docs is `pyarrow.parquet.Schema`, I 
> don't know how to produce a object with this since we are forbbiden to use 
> the Schema constructor directly.
> If we check the implementation on github we get directly this line 
> [here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
> {code:java}
> dataset_schema = self.schema.to_arrow_schema(){code}
> Is this a problem in the schema builder or the parquet dataset object?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7727) Unable to read a ParquetDataset when schema validation is on.

Reply via email to