westonpace commented on PR #35860:
URL: https://github.com/apache/arrow/pull/35860#issuecomment-1571343210

   > The following do not:
   > 
   > pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", 
schema=schema_nullable, format="parquet")  
   > 
   > or
   > 
   > pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", 
schema=schema_nullable, format="parquet") 
   > 
   
   These lines failed for me with the following error:
   
   ```
   pyarrow/dataset.py:936: in write_dataset
       data = InMemoryDataset(data, schema=schema)
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   raise ArrowTypeError(
   E   pyarrow.lib.ArrowTypeError: Item has schema
   E   x: int64
   E   y: int64
   E   which does not match expected schema
   E   x: int64 not null
   E   y: int64
   ```
   
   I thought this was supported and it took me a moment to track down what was 
going on.  The error is actually being raised before the C++ call to write the 
dataset.  Pyarrow is taking the two inputs (`table`, `table_no_null`) and 
trying to put them in an `InMemoryDataset` and specifying the schema.  The 
constructor for `InMemoryDataset` is verifying that all the tables it has been 
given have the same schema and throwing an error because it was given a table 
whose schema does not match the dataset's schema.
   
   If this is the same error you were getting then I think we can call this an 
invalid scenario and we don't have to support it (at least for this PR.  
Arguably, you could evolve a table into the correct schema if adding it to an 
InMemoryDataset but that's a different feature).
   
   This is kind of confusing because @anjakefala and I were testing earlier and 
you are allowed to create an `InMemoryDataset` with tables / batches who have 
the same types / nullability but different field metadata.  So I created an 
additional python test case for field metadata which does verify the "two 
tables but mixed metadata can be overridden by an explicit schema" call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to