[ https://issues.apache.org/jira/browse/ARROW-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-11473: ------------------------------------ Summary: [Python] Needs a handling for missing columns while reading parquet file (was: Needs a handling for missing columns while reading parquet file ) > [Python] Needs a handling for missing columns while reading parquet file > ------------------------------------------------------------------------- > > Key: ARROW-11473 > URL: https://issues.apache.org/jira/browse/ARROW-11473 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Reporter: jason khadka > Priority: Major > > Currently there is no way to handle the error raised by missing columns in > parquet file. > If a column passed is missing, it just raises ArrowInvalid error > {code:java} > columns=[item1, item2, item3] #item3 is not there in parquet file > pd.read_parquet(file_name, columns = columns) > > ArrowInvalid: Field named 'item3' not found or not unique in the > > schema.{code} > There is no way to handle this. The ArrowInvalid also does not carry any > information that can give out the field name so that in next try this filed > can be ignored. > Example : > {code:java} > from pyarrow.lib import ArrowInvalid > read_columns = ['a','b','X'] > df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) > file_name = '/tmp/my_df.pq' df.to_parquet(file_name) > try: > df = pd.read_parquet(file_name, columns = read_columns) > except ArrowInvalid as e: > inval = e > print(inval.args) > >("Field named 'X' not found or not unique in the schema.",){code} > > You could parse the message above to get 'X', but that is a bit of hectic > solution. It would be great if the error message contained the field name. > So, you could do for example : > > {code:java} > inval.field > > 'X'{code} > Or a better feature would be to have a error handling in read_table of > pyarrow, where something like \{{error='ignore'}}could be passed. This would > then ignore the missing column by checking the schema. > Example, in case above : > {code:java} > df = pd.read_parquet(file_name, columns = read_columns, error = > 'ignore'){code} > Would ignore the missing column 'X' -- This message was sent by Atlassian Jira (v8.3.4#803005)