joellubi commented on issue #39489:
URL: https://github.com/apache/arrow/issues/39489#issuecomment-1888869766

   > I don't know if that wouldn't make things more complex / confusing for 
users though, that a similar file (just by a different producer) would give 
different results.
   
   I agree this could be confusing. For example datalakes are often written to 
by many potentially different producers so inconsistencies when reading could 
surface in this scenario.
   
   > To be honest, I am not sure what would be the best option. Add a way to 
control the behaviour of course make this more flexible for the user/library 
using arrow, but 1) is it still needed to add this given it is for legacy 
files? (I don't know how many producers still generate them), and 2) we still 
need to make a choice of a default in the Arrow libraries.
   
   I think that an option to control the behavior would be valuable, but a 
default that doesn't prompt users to update their files or application logic 
probably isn't helpful. Do we use warnings or any other ways deprecating 
certain usage in other parts of Arrow?
   
   > Assume we would keep the default in Arrow as it was before, do you have 
applications that would set it differently? (maybe yes, as that might have 
triggered this discussion?)
   
   For what it's worth, I do not. I was having issues in an application using 
the Go Parquet implementation which had a bug different from the logic here. As 
part of updating the Go implementation, I referred to the Parquet format 
document as well as a few other existing implementations which is when I 
noticed this inconsistency here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to