aihuaxu commented on PR #455: URL: https://github.com/apache/arrow-go/pull/455#issuecomment-3137794519
> Several test cases test variations on situations where the value column is missing. Based on my reading of the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) this seems to be an invalid scenario. The specific case is that the spec states the typed_value field may be omitted when not shredding elements as a specific type, but says nothing about allowing omission of the value field. Currently, the Go implementation will error if this field is missing as per my reading of the spec, meaning those test cases fail. Both `value` and `typed_value` are optional per spec and `value` can be missing as I understand. <img width="1091" height="251" alt="image" src="https://github.com/user-attachments/assets/32432ac1-edea-477b-81c0-967dd9304115" /> > Test case 43 testPartiallyShreddedObjectMissingFieldConflict seems to have a conflict between what is expected and what in the spec. The b field exists within the value field, while also being a shredded field, the test appears to assume the data in the value field would be ignored, but the [spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects) says that value must never contain fields represented by the shredded fields. This needs clarification on the desired behavior and result. `The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value.` This test case is to prove that the reader will only read from `typed_value` and ignore the one from `value`. That means, the reader is not responsible to validate the duplicate key and the reader will read from `typed_value`. > Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case. This seems to be reasonable to error out for me to enforce it's required in the schema. ``` required group email { optional binary value; optional binary typed_value (STRING); } ``` > Test case 38 testShreddedObjectMissingTypedValue tests the case where the typed_value field is missing, this is allowed by the spec except that the spec states that in this scenario the value field must be required. The test case uses optional in this scenario causing the Go implementation to fail. Clarification is needed here. We will generate the schema first which will have both `value` and `typed_value` optional. But a value is to be shredded, the `value` column may be required. Do we fail in GO that `value` schema is optional? > Test case 125, testPartiallyShreddedObjectFieldConflict again tests the case of a field existing in both the value and the shredded column which the spec states is invalid and will lead to inconsistent results. Thus it is not valid to have this test case assert a specific result according to the spec unless the spec is amended to state that the shredded field takes precedence in this case. This is same as test case 43. My understanding is that if writer writes wrong data, the reader may only read the `typed_value`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org