wiedld opened a new issue, #6177:
URL: https://github.com/apache/arrow-rs/issues/6177

   **Is your feature request related to a ~problem or~ challenge? Please 
describe what you are trying to do.**
   
   We have been using at least two parquet writers that both utilize the 
low-level APIs provided by the parquet crate (e.g. 
[SerializedFileWriter](https://docs.rs/parquet/52.2.0/parquet/file/writer/struct.SerializedFileWriter.html)).
 One of the writers (ArrowWriter) is provided as part of the parquet crate, 
whereas the other parallel writer (datafusion's ParquetSink) is not. However, 
in both cases we later attempt to read these files using the parquet crate's 
readers.
   
   The challenge is that we keep encountering unexpected differences in the 
parquet written by these two writers. The most recent example is that the 
[arrow schema is missing when using datafusion's parallel 
writer](https://github.com/apache/datafusion/issues/11770), whereas it is 
[included in the 
ArrowWriter](ttps://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_writer/mod.rs#L188-L190)
 on parquet write.
   
   
   **Describe the solution you'd like**
   
   Can we update the lower level APIs (in the parquet crate) to make it easier 
for users to create their own parquet writers -- without encountering surprise 
differences from the behavior of parquet's ArrowWriter? Provide better 
documentation? Provide guidance for testing when creating your own parquet 
writer?
   
   
   **Describe alternatives you've considered**
   
   Alternatively, we could consider this problem as the responsibility for the 
users that create their own writers. We already plan to file a datafusion 
ticket proposing that we need integration tests to ensure byte equivalency in 
the output parquet (vs parquet written by ArrowWriter).
   
   **Additional context**
   
   This is not the first time that we have discovered differences in the output 
parquet from the ArrowWriter vs datafusion's ParquetSink. However, we are 
unclear on the best way to divide responsibilities (for more testing) vs API 
design.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to