[
https://issues.apache.org/jira/browse/ARROW-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461572#comment-17461572
]
Will Jones commented on ARROW-11473:
------------------------------------
Hi Jason,
If you know the schema ahead of time (it seems like you are expecting a certain
column), the datasets module might be useful to you. Any missing columns can be
populated with null.
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
read_columns = ['a','b','X']
df = pa.table({'a': pa.array([1, 2, 3]), 'b': pa.array(['foo', 'bar', 'jar'])})
file_name = '/tmp/my_df.pq'
pq.write_table(df, file_name)
my_schema = pa.schema([
pa.field("a", pa.int64()),
pa.field("b", pa.utf8()),
pa.field("X", pa.utf8())
])
dataset = ds.dataset(file_name, format="parquet", schema=my_schema)
df = dataset.to_table()
print(df)
# pyarrow.Table
# a: int64
# b: string
# X: string
# ----
# a: [[1,2,3]]
# b: [["foo","bar","jar"]]
# X: [[null,null,null]] {code}
> [Python] Needs a handling for missing columns while reading parquet file
> -------------------------------------------------------------------------
>
> Key: ARROW-11473
> URL: https://issues.apache.org/jira/browse/ARROW-11473
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Reporter: jason khadka
> Priority: Major
>
> Currently there is no way to handle the error raised by missing columns in
> parquet file.
> If a column passed is missing, it just raises ArrowInvalid error
> {code:java}
> columns=[item1, item2, item3] #item3 is not there in parquet file
> pd.read_parquet(file_name, columns = columns)
> > ArrowInvalid: Field named 'item3' not found or not unique in the
> > schema.{code}
> There is no way to handle this. The ArrowInvalid also does not carry any
> information that can give out the field name so that in next try this filed
> can be ignored.
> Example :
> {code:java}
> from pyarrow.lib import ArrowInvalid
> read_columns = ['a','b','X']
> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
> file_name = '/tmp/my_df.pq' df.to_parquet(file_name)
> try:
> df = pd.read_parquet(file_name, columns = read_columns)
> except ArrowInvalid as e:
> inval = e
> print(inval.args)
> >("Field named 'X' not found or not unique in the schema.",){code}
>
> You could parse the message above to get 'X', but that is a bit of hectic
> solution. It would be great if the error message contained the field name.
> So, you could do for example :
>
> {code:java}
> inval.field
> > 'X'{code}
> Or a better feature would be to have a error handling in read_table of
> pyarrow, where something like \{{error='ignore'}}could be passed. This would
> then ignore the missing column by checking the schema.
> Example, in case above :
> {code:java}
> df = pd.read_parquet(file_name, columns = read_columns, error =
> 'ignore'){code}
> Would ignore the missing column 'X'
--
This message was sent by Atlassian Jira
(v8.20.1#820001)