zeroshade opened a new pull request, #455:
URL: https://github.com/apache/arrow-go/pull/455

   ### Rationale for this change
   Testing out the variant implementation here against Parquet java using the 
test cases generated in 
https://github.com/apache/parquet-testing/pull/90/files. Overall, it confirms 
that our implementation is generally compatible for reading parquet files 
written by parquet-java with some caveats.
   
   ### What changes are included in this PR?
   New testing suite in `parquet/pqarrow/variant_test.go` which uses the test 
cases defined in parquet-testing and attempts to read the parquet files and 
compares the resulting variants against the expected ones.
   
   Some issues were found that I believe are issues with Parquet-java and the 
test cases rather than issues with the Go implementation, as such discussion is 
needed for the following:
   
   * The parquet test files are missing the Logical Variant Type annotation. 
Currently I've worked around that for testing purposes, but not in a way that 
can be merged or that is sustainable. As such the files need to be re-generated 
with the Variant Logical Type annotation before these tests can be enabled.
   * Several test cases test variations on situations where the `value` column 
is missing. Based on my reading of the 
[spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md)
 this seems to be an invalid scenario. The specific case is that the spec 
states the `typed_value` field may be omitted when not shredding elements as a 
specific type, but says nothing about allowing omission of the `value` field. 
Currently, the Go implementation will error if this field is missing as per my 
reading of the spec, meaning those test cases fail.
   * Test case 43 `testPartiallyShreddedObjectMissingFieldConflict` seems to 
have a conflict between what is expected and what in the spec. The `b` field 
exists within the `value` field, while also being a shredded field, the test 
appears to assume the data in the `value` field would be ignored, but the 
[spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects)
 says that `value` **must never** contain fields represented by the shredded 
fields. This needs clarification on the desired behavior and result.
   * Test case 84, `testShreddedObjectWithOptionalFieldStructs` tests the 
schenario where the shredded fields of an object are listed as `optional` in 
the schema, but the spec states that they *must* be `required`. Thus, the Go 
implementation errors on this test as the spec says this is an error. 
Clarification is needed on if this is a valid test case.
   * Test case 38 `testShreddedObjectMissingTypedValue` tests the case where 
the `typed_value` field is missing, this is allowed by the spec except that the 
spec states that in this scenario the `value` field **must** be `required`. The 
test case uses `optional` in this scenario causing the Go implementation to 
fail. Clarification is needed here.
   * Test case 125, `testPartiallyShreddedObjectFieldConflict` again tests the 
case of a field existing in both the `value` and the shredded column which the 
spec states is invalid and will lead to inconsistent results. Thus it is not 
valid to have this test case assert a specific result according to the spec 
unless the spec is amended to state that the shredded field takes precedence in 
this case.
   * One thing that makes the tests a bit difficult is that when we un-shred 
back into variants, the current variant code in some libraries will 
automatically downcast to the smallest viable precision (downcasting an int32 
into an int16 for example if it fits). This is worked around by testing the 
*values* rather than the types, but is worth mentioning. Particularly in the 
case of decimal values
   * A couple error test cases verify that particular types are **not** 
supported such as UINT32 or Fixed Len Byte Array(4), nothing in the spec 
however says that an implementation couldn't just upcast a uint32 -> int64 or 
treat a fixed len byte array shredded column as a binary column. So is it 
meaningful to explicitly error on those cases rather than allow them since they 
are trivially convertable to valid variant types?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to