joellubi commented on issue #39489: URL: https://github.com/apache/arrow/issues/39489#issuecomment-1888869766
> I don't know if that wouldn't make things more complex / confusing for users though, that a similar file (just by a different producer) would give different results. I agree this could be confusing. For example datalakes are often written to by many potentially different producers so inconsistencies when reading could surface in this scenario. > To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries. I think that an option to control the behavior would be valuable, but a default that doesn't prompt users to update their files or application logic probably isn't helpful. Do we use warnings or any other ways deprecating certain usage in other parts of Arrow? > Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?) For what it's worth, I do not. I was having issues in an application using the Go Parquet implementation which had a bug different from the logic here. As part of updating the Go implementation, I referred to the Parquet format document as well as a few other existing implementations which is when I noticed this inconsistency here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org