zeroshade opened a new pull request, #455: URL: https://github.com/apache/arrow-go/pull/455
### Rationale for this change Testing out the variant implementation here against Parquet java using the test cases generated in https://github.com/apache/parquet-testing/pull/90/files. Overall, it confirms that our implementation is generally compatible for reading parquet files written by parquet-java with some caveats. ### What changes are included in this PR? New testing suite in `parquet/pqarrow/variant_test.go` which uses the test cases defined in parquet-testing and attempts to read the parquet files and compares the resulting variants against the expected ones. Some issues were found that I believe are issues with Parquet-java and the test cases rather than issues with the Go implementation, as such discussion is needed for the following: * The parquet test files are missing the Logical Variant Type annotation. Currently I've worked around that for testing purposes, but not in a way that can be merged or that is sustainable. As such the files need to be re-generated with the Variant Logical Type annotation before these tests can be enabled. * Several test cases test variations on situations where the `value` column is missing. Based on my reading of the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) this seems to be an invalid scenario. The specific case is that the spec states the `typed_value` field may be omitted when not shredding elements as a specific type, but says nothing about allowing omission of the `value` field. Currently, the Go implementation will error if this field is missing as per my reading of the spec, meaning those test cases fail. * Test case 43 `testPartiallyShreddedObjectMissingFieldConflict` seems to have a conflict between what is expected and what in the spec. The `b` field exists within the `value` field, while also being a shredded field, the test appears to assume the data in the `value` field would be ignored, but the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects) says that `value` **must never** contain fields represented by the shredded fields. This needs clarification on the desired behavior and result. * Test case 84, `testShreddedObjectWithOptionalFieldStructs` tests the schenario where the shredded fields of an object are listed as `optional` in the schema, but the spec states that they *must* be `required`. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case. * Test case 38 `testShreddedObjectMissingTypedValue` tests the case where the `typed_value` field is missing, this is allowed by the spec except that the spec states that in this scenario the `value` field **must** be `required`. The test case uses `optional` in this scenario causing the Go implementation to fail. Clarification is needed here. * Test case 125, `testPartiallyShreddedObjectFieldConflict` again tests the case of a field existing in both the `value` and the shredded column which the spec states is invalid and will lead to inconsistent results. Thus it is not valid to have this test case assert a specific result according to the spec unless the spec is amended to state that the shredded field takes precedence in this case. * One thing that makes the tests a bit difficult is that when we un-shred back into variants, the current variant code in some libraries will automatically downcast to the smallest viable precision (downcasting an int32 into an int16 for example if it fits). This is worked around by testing the *values* rather than the types, but is worth mentioning. Particularly in the case of decimal values * A couple error test cases verify that particular types are **not** supported such as UINT32 or Fixed Len Byte Array(4), nothing in the spec however says that an implementation couldn't just upcast a uint32 -> int64 or treat a fixed len byte array shredded column as a binary column. So is it meaningful to explicitly error on those cases rather than allow them since they are trivially convertable to valid variant types? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org