[GitHub] [arrow] paolosartiprom commented on issue #37583: [Python] pyarrow.dataset.write_dataset doesn't write empty datasets, even if they had a schema, thus losing it when reading back the dataset

via GitHub Thu, 21 Sep 2023 02:42:11 -0700


paolosartiprom commented on issue #37583:
URL: https://github.com/apache/arrow/issues/37583#issuecomment-1729222896


   I think that it would be breaking if some users are trying to guess if the 
dataset is empty only by listing the directory... If it isn't the case anymore, 
and pyarrow would always write a parquet file, once the dataset is loaded, it 
would still be an empty dataset, just with schema and metadata.
   
   Using some separate metadata/schema file is interesting but it is something 
that I don't know if it is supported anywhere else in this library. Is it? It's 
also unnecessary when the data actually has some rows. Moreover, other parquet 
drivers I guess will always understand a parquet file with 0 rows, but not a 
schema file?
   
   I think that writing an empty parquet table if the dataset has 0 rows would 
also be consistent with what pyarrow.parquet.write_table does.
   
   Anyway, I agree that it might break some code that assumes the current (in 
my opinion wrong) behaviour.
   
   I think that just fixing this in a major release with this behaviour 
documented in the release notes would be the best option. Maybe an option to 
opt out and get the legacy behaviour (the current one) would be optimal.
   
   Otherwise, if a breaking change is not an option, a flag to opt in to this 
behaviour would be nice if possible.
   Maybe something like: preserve_schema_if_empty? keep_schema_if_empty? I 
guess I would always pass this as true but maybe it's just the current needs of 
the project I'm on.
   
   What do you think? @mapleFU 
   
   Thank you very much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paolosartiprom commented on issue #37583: [Python] pyarrow.dataset.write_dataset doesn't write empty datasets, even if they had a schema, thus losing it when reading back the dataset

Reply via email to