mattaubury opened a new issue, #39670:
URL: https://github.com/apache/arrow/issues/39670

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Imagine we have a couple of Parquet files with non-nullable columns with 
different names:
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   x = pa.Table.from_arrays([pa.array([1, 2, 3])], 
schema=pa.schema([pa.field("x", pa.int32(), nullable=False)]))
   y = pa.Table.from_arrays([pa.array([1, 2, 3])], 
schema=pa.schema([pa.field("y", pa.int32(), nullable=False)]))
   
   pq.write_table(x, "x.parquet")
   pq.write_table(y, "y.parquet")
   ```
   We can read these with `Dataset` if we provide it with the combined schema:
   ```
   schema = pa.unify_schemas([x.schema, y.schema])
   table = ds.dataset(["x.parquet", "y.parquet"], schema=schema).to_table()
   
   print(table)
   ```
   Which gives us a slightly curious result, of a table with non-nullable 
columns which contain nulls:
   ```
   pyarrow.Table
   x: int32 not null
   y: int32 not null
   ----
   x: [[1,2,3],[null,null,null]]
   y: [[null,null,null],[1,2,3]]
   ```
   Which means that casting a table to its own schema fails (which I don't 
think should ever happen):
   ```
   >>> table.cast(table.schema)
   ...
   ValueError: Casting field 'x' with null values to non-nullable
   ```
   This also breaks if we don't provide a schema, `Dataset` uses the schema 
from `x.parquet` so the output is:
   ```
   pyarrow.Table
   x: int32 not null
   ----
   x: [[1,2,3],[null,null,null]]
   ```
   So ideally both:
   - `Dataset` would notice if it was adding null chunks and fixup the schema 
of the returned table
   - `pa.unify_schemas()` would promote mismatched non-nullable field names to 
nullable
   
   ### Component(s)
   
   C++, Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to