zeroshade commented on PR #455:
URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3161096586

   @rdblue @aihuaxu @emkornfield @wgtmac @cashmand 
   
   I've been thinking about this a bit more over the past few days and had the 
following thoughts. Please let me know what you all think.
   
   > The implementation I generated these cases from is defensive and tries to 
read if it can rather than producing errors. I'd recommend doing the same thing 
to handle outside-of-spec cases.
   
   The problem I see here is multi-faceted. While I understand the idea and 
reasons for doing this, allowing outside-of-spec things on read but not on 
write causes a few problems:
   
   * It explicitly creates scenarios that *cannot* be round-tripped. You can 
read things which don't conform to the spec, sure, but then you have to write a 
spec-conforming result. 
   * There's no incentive for bad writers to fix their issues. If the 
convention becomes that these invalid scenarios are allowed to be read, despite 
the spec not actually saying it should be allowed, then any writers which 
currently write invalid, non-spec-conforming files have no incentive to fix 
their issues. 
   * It significantly increases the complexity of implementing the spec. It 
puts the onus on implementors to deduce that they have to allow these cases 
despite the lack of language in the spec, not to mention the actual difficulty 
of any Arrow integration with Parquet to have to either allow these invalid 
constructions or perform conversions and casts to handle the required/optional 
differences and whether or not specific fields exist.
   
   > For instance, most of the time if a column is missing, most 
implementations will allow you to project a column of nulls. Extending this 
idea to Variant, it's reasonable to assume that a missing value column 
indicates a column of nulls and read accordingly instead of failing. The other 
cases are similar.
   
   This is true for compute engines, but not necessarily true for regular 
Parquet implementations. Generally, the idea of "projection" is an 
*engine-level* concept. Most implementations of parquet will simply error if 
you try to read a column that doesn't exist, letting the level above the file 
interactions decide upon projection.
   
   > The behavior in these cases was debated when we were working on the spec. 
We ultimately decided to disallow writers from producing them, but I think it 
is a best practice to ensure that the read behavior is predictable, accepts 
even slightly malformed cases, and has consistent behavior depending on the 
projected columns.
   
   I agree with this sentiment. We definitely want read behavior to be 
predictable and have consistent behavior. If the desired consistent behavior 
isn't actually written into the spec, how would one know what that behavior 
should be?
   
   > My preference would be to relax the spec for this issue. It doesn't seem 
like there's much benefit to enforcing it on the read side, and it's easy to 
imagine a writer failing to enforce it in some cases where it usually adds a 
typed_value to the schema, but not always.
   
   This touches on my ultimate point honestly. My conclusion ultimately is that 
one of two things should happen:
   
   1. The spec needs to be updated to *explicitly* state what the behavior 
should be in all of these test cases I have identified, and either have the 
wording updated accordingly. (Fields being `present` vs `required`, expressly 
stating whether certain fields *must* be `required`, *must* be `optional` or 
are allowed to be either, and so on).
   2. These test cases that fall outside the scope of the spec should not exist 
in `parquet-testing` as integration tests. The integration tests should only 
test what the spec *actually* states, not conventions of a particular 
implementation. If some behavior is expected to be consistent, then it should 
be written in the spec explicitly so that there's no confusion or ambiguity 
what the behavior is expected to be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to