[GitHub] [arrow] westonpace commented on issue #9194: Needs a handling for missing columns in parquet file

GitBox Fri, 15 Jan 2021 09:03:49 -0800


westonpace commented on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-761062704



   Do you know the data type of the missing column?  If so, you can use the 
datasets API to read the table.  The datasets API can take in a expected schema 
that has all columns that might be asked for.  This allows for dataset 
evolution where you have a master schema for a collection of files but 
individual files might not have all the columns.
   
   ```
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as pads
   
   read_columns = ['a','b','X']
   
   df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
   file_name = '/tmp/my_df.pq'
   df.to_parquet(file_name)
   
   schema = pa.schema([
       ('a', pa.int64()),
       ('b', pa.string()),
       ('X', pa.int32())
   ])
   
   # df = pd.read_parquet(file_name, columns = read_columns)                    
                                                                                
                                              
   ds = pads.dataset([file_name], schema=schema)
   table = ds.to_table()
   print(table)
   print(table.column('X').to_pylist())
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #9194: Needs a handling for missing columns in parquet file

Reply via email to