tschaub commented on PR #37817:
URL: https://github.com/apache/arrow/pull/37817#issuecomment-1730257997
@zeroshade (or others) - I haven't been able to get to the bottom of the
issue mentioned above, and I could use some input.
I might be working with a flawed understanding of things, so I'll describe
how I think things are supposed to work first. My assumption is an Arrow
`array.List` can either be represented in Parquet using a repeated primitive
field or a group field with a primitive nested in it.
For example, I thought an `array.List` containing `array.Float64` data could
be written to Parquet using either of the two schema below.
A repeated primitive field:
```
required group root {
repeated double numbers;
}
```
Or a primitive nested in a group annotated as a list:
```
required group root {
required group numbers (LIST) {
repeated group list {
optional double element;
}
}
}
```
The `pqarrow.FileWriter` currently uses the nested group structure when
creating a Parquet schema. The changes in this branch make it so the file
writer conditionally uses a repeated primitive instead - this happens if the
passed Arrow schema was converted from a Parquet schema that uses a repeated
primitive.
I'm finding that record writing works when using a Parquet schema with the
nested group but that it doesn't work when using a schema with a repeated
primitive.
I tried to write tests to demonstrate the issue based on the latest in
`main`, but I don't think it can be done using the exported
`pqarrow.NewFileWriter()` function. So in an effort to narrow things down, I
created a similar `pqarrow.NewFileWriterForTesting()` function that accepts
both a Parquet schema and an Arrow schema (instead of creating the former from
the latter).
With that new function, this test passes:
```go
func TestRoundTripListOfFloat64(t *testing.T) {
inputNumbers := []float64{10, 9, 8}
name := "numbers"
parquetSchema := schema.NewSchema(schema.MustGroup(
schema.NewGroupNode("root", parquet.Repetitions.Required,
schema.FieldList{
schema.MustGroup(schema.ListOf(
schema.NewFloat64Node(name,
parquet.Repetitions.Optional, -1),
parquet.Repetitions.Required,
-1,
)),
}, -1),
))
outputNumbers := roundTripFloat64s(t, parquetSchema, name, inputNumbers)
assert.Equal(t, inputNumbers, outputNumbers)
// this passes
}
```
But this test fails:
```go
func TestRoundTripRepeatedFloat64(t *testing.T) {
inputNumbers := []float64{10, 9, 8}
name := "numbers"
parquetSchema := schema.NewSchema(schema.MustGroup(
schema.NewGroupNode("root", parquet.Repetitions.Required,
schema.FieldList{
schema.NewFloat64Node(name,
parquet.Repetitions.Repeated, -1),
}, -1),
))
outputNumbers := roundTripFloat64s(t, parquetSchema, name, inputNumbers)
assert.Equal(t, inputNumbers, outputNumbers)
// this fails
// expected: []float64{10, 9, 8}
// actual : []float64(nil)
}
```
See
https://github.com/apache/arrow/compare/main...tschaub:arrow:write-repeated for
the `roundTripFloat64s` function and the `pqarray.NewFileWriterForTesting`.
(Those changes are based on the latest in the `main` branch, not the changes in
this branch).
I've stepped through the Arrow column writer writing code and can confirm
that the `defLevels` are different in the two tests above (`[3, 3, 3]` for the
passing case vs `[2, 2, 2]` for the failing case), but I can't make sense of
why the values are not getting written in the case of the schema with the
repeated primitive.
I know this is an unsolicited change, and a lot to dig into, but I
appreciate any thoughts that others have.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]