[GitHub] [arrow-datafusion] wjones127 opened a new issue, #5950: Allow providing Arrow schema when scanning Parquet files

via GitHub Mon, 10 Apr 2023 18:52:58 -0700


wjones127 opened a new issue, #5950:
URL: https://github.com/apache/arrow-datafusion/issues/5950


   ### Is your feature request related to a problem or challenge?
   
   When scanning Parquet files, we'd often like to provide an expected schema, 
since:
   
   1. The Parquet files might not all have an identical physical schema, but we 
may know the unified schema up front, such as when we are using a table format 
like Delta Lake.
   2. A given Parquet type can map to many different Arrow types. For example, 
when possible it would be nice to read a string column as a DictionaryArray, 
especially if the data is already dictionary-encoded.
   
   ### Describe the solution you'd like
   
   It would be nice to be able to provide an Arrow schema in 
`ParquetScanOptions`, and then the scan would try in order:
   
   1. Read the Parquet data directly into the output type
   2. Read the Parquet data into a supported Arrow type then cast
   3. Return an error stating which types can't be mapped
   
   This probably would require some changes upstream in the parquet crate, but 
for at least some of the functionality datafusion seems like the right place. 
LMK if you think differently.
   
   ### Describe alternatives you've considered
   
   Right now our current issue is that our expected schema (from the Delta Lake 
log) doesn't match the physical schema (at least when written by Spark). So as 
a workaround we are looking at the metadata of one of the Parquet files. This 
will partially solve the issue, but will likely fail for tables where Spark 
wasn't the only engine to write to the table.
   
   ### Additional context
   
   Here's a simple example where PyArrow's scanner works but Datafusions 
doesn't seem to:
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   tab1 = pa.table({"a": pa.array([1, 2, 3, 4, 5], type=pa.int32())})
   tab2 = pa.table({"a": pa.array([6, 7, 8, 9, 10], type=pa.int64())})
   
   file1 = 'table/1.parquet'
   file2 = 'table/2.parquet'
   
   pq.write_table(tab1, file1)
   pq.write_table(tab2, file2)
   ```
   
   ```python
   import pyarrow.dataset as ds
   
   ds.dataset("table", format="parquet").to_table()
   ```
   
   ```
   pyarrow.Table
   a: int32
   ----
   a: [[1,2,3,4,5],[6,7,8,9,10]]
   ```
   
   ```python
   from datafusion import SessionContext
   
   # Create a DataFusion context
   ctx = SessionContext()
   
   # Register table with context
   ctx.register_parquet('numbers', 'table')
   ```
   
   ```
   Exception: DataFusion error: ArrowError(SchemaError("Fail to merge schema 
field 'a' because the from data_type = Int64 does not equal Int32"))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] wjones127 opened a new issue, #5950: Allow providing Arrow schema when scanning Parquet files

Reply via email to