[jira] [Commented] (ARROW-11473) [Python] Needs a handling for missing columns while reading parquet file

Alenka Frim (Jira) Thu, 16 Dec 2021 05:53:20 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460722#comment-17460722
 ]


Alenka Frim commented on ARROW-11473:
-------------------------------------

Hi [~jasonkhadka], sorry for such a late reply.

What you could do is use the column names from the metadata of the parquet file 
to get a subset of columns you want to read. Using you example I did it like so:
{code:python}
import pyarrow.parquet as pq
import pandas as pd

read_columns = ['a','X'] 
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})

file_name = '/tmp/my_df.pq'

df.to_parquet(file_name)
m = pq.read_metadata(file_name) # reads only the metadata

# Get the column names from the schema
df_columns = m.schema.names
# Do an intersection with the names you want to read
columns = list(set(read_columns) & set(df_columns))

pd.read_parquet(file_name, columns = columns)
{code}
 
The output:
{code:python}
   a
0  1
1  2
2  3
{code}

> [Python] Needs a handling for missing columns while reading parquet file 
> -------------------------------------------------------------------------
>
>                 Key: ARROW-11473
>                 URL: https://issues.apache.org/jira/browse/ARROW-11473
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: jason khadka
>            Priority: Major
>
> Currently there is no way to handle the error raised by missing columns in 
> parquet file.
> If a column passed is missing, it just raises ArrowInvalid error
> {code:java}
> columns=[item1, item2, item3] #item3 is not there in parquet file
> pd.read_parquet(file_name, columns = columns)
> > ArrowInvalid: Field named 'item3' not found or not unique in the 
> > schema.{code}
> There is no way to handle this. The ArrowInvalid also does not carry any 
> information that can give out the field name so that in next try this filed 
> can be ignored.
> Example :
> {code:java}
> from pyarrow.lib import ArrowInvalid 
> read_columns = ['a','b','X'] 
> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 
> file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 
> try: 
>     df = pd.read_parquet(file_name, columns = read_columns) 
> except ArrowInvalid as e: 
>     inval = e 
> print(inval.args)
> >("Field named 'X' not found or not unique in the schema.",){code}
>  
> You could parse the message above to get 'X', but that is a bit of hectic 
> solution. It would be great if the error message contained the field name. 
> So, you could do for example :
>  
> {code:java}
> inval.field 
> > 'X'{code}
> Or a better feature would be to have a error handling in read_table of 
> pyarrow, where something like \{{error='ignore'}}could be passed. This would 
> then ignore the missing column by checking the schema.
> Example, in case above :
> {code:java}
> df = pd.read_parquet(file_name, columns = read_columns, error = 
> 'ignore'){code}
> Would ignore the missing column 'X'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-11473) [Python] Needs a handling for missing columns while reading parquet file

Reply via email to