jorisvandenbossche commented on issue #39489:
URL: https://github.com/apache/arrow/issues/39489#issuecomment-1888764091

   Comment from @pitrou on the PR at 
https://github.com/apache/arrow/pull/39491#issuecomment-1884825697:
   
   > > Nowadays it is maybe OK to make a different trade-off, and to prioritize 
the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow 
legacy files, given that it is about files written by an Arrow version of 
several years old (Arrow < 0.15).
   > 
   > If we really care about this, we should perhaps detect the offending Arrow 
version, since the Parquet file records the producer name and version.
   
   We could indeed infer if the file was written by Arrow (the version 
shouldn't matter, I think, because in the end we care about whether the 
LogicalType is present or not, which is always the case starting from a certain 
version). 
   I don't know if that wouldn't make things more complex / confusing for users 
though, that a similar file (just by a different producer) would give different 
results.
   
   > But we would also need an actual file to test with.
   
   Above (https://github.com/apache/arrow/issues/39489#issuecomment-1884416205) 
I posted a small example, and updated that comment to attach the generated file 
that I used. 
   We do actually have a file in the pyarrow tests that covers this 
([python/pyarrow/tests/data/parquet/v0.7.1.column-metadata-handling.parquet](https://github.com/apache/arrow/blob/main/python/pyarrow/tests/data/parquet/v0.7.1.column-metadata-handling.parquet)),
 but the test didn't catch that, because it only tested roundtrip to/from 
pandas, and for pandas it was using additional _pandas_ metadata we did already 
store in the schema at the time, and so that still correctly roundtrips. 
   
   But, yes, in general we have a lack of testing for Parquet on this front: 
https://github.com/apache/arrow/issues/22325
   
   ---
   
   > Would you be in favor of adding a field to `ArrowReaderProperties` to 
control this behavior?
   
   To be honest, I am not sure what would be the best option. Add a way to 
control the behaviour of course make this more flexible for the user/library 
using arrow, but 1) is it still needed to add this given it is for legacy 
files? (I don't know how many producers still generate them), and 2) we still 
need to make a choice of a default in the Arrow libraries. 
   
   Assume we would keep the default in Arrow as it was before, do you have 
applications that would set it differently? (maybe yes, as that might have 
triggered this discussion?)
   
   And again: this whole change is maybe fine nowadays. I mostly wanted to 
point out this was _not_ a bug, but a deliberate design decision in the past, 
and if we change this, we should do it deliberately and flag it as a breaking 
change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to