jorisvandenbossche commented on issue #39489: URL: https://github.com/apache/arrow/issues/39489#issuecomment-1888764091
Comment from @pitrou on the PR at https://github.com/apache/arrow/pull/39491#issuecomment-1884825697: > > Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15). > > If we really care about this, we should perhaps detect the offending Arrow version, since the Parquet file records the producer name and version. We could indeed infer if the file was written by Arrow (the version shouldn't matter, I think, because in the end we care about whether the LogicalType is present or not, which is always the case starting from a certain version). I don't know if that wouldn't make things more complex / confusing for users though, that a similar file (just by a different producer) would give different results. > But we would also need an actual file to test with. Above (https://github.com/apache/arrow/issues/39489#issuecomment-1884416205) I posted a small example, and updated that comment to attach the generated file that I used. We do actually have a file in the pyarrow tests that covers this ([python/pyarrow/tests/data/parquet/v0.7.1.column-metadata-handling.parquet](https://github.com/apache/arrow/blob/main/python/pyarrow/tests/data/parquet/v0.7.1.column-metadata-handling.parquet)), but the test didn't catch that, because it only tested roundtrip to/from pandas, and for pandas it was using additional _pandas_ metadata we did already store in the schema at the time, and so that still correctly roundtrips. But, yes, in general we have a lack of testing for Parquet on this front: https://github.com/apache/arrow/issues/22325 --- > Would you be in favor of adding a field to `ArrowReaderProperties` to control this behavior? To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries. Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?) And again: this whole change is maybe fine nowadays. I mostly wanted to point out this was _not_ a bug, but a deliberate design decision in the past, and if we change this, we should do it deliberately and flag it as a breaking change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org