tschaub commented on PR #37817:
URL: https://github.com/apache/arrow/pull/37817#issuecomment-1730257997

   @zeroshade (or others) - I haven't been able to get to the bottom of the 
issue mentioned above, and I could use some input.
   
   I might be working with a flawed understanding of things, so I'll describe 
how I think things are supposed to work first.  My assumption is an Arrow 
`array.List` can either be represented in Parquet using a repeated primitive 
field or a group field with a primitive nested in it.
   
   For example, I thought an `array.List` containing `array.Float64` data could 
be written to Parquet using either of the two schema below.
   
   A repeated primitive field:
   ```
   required group root {
     repeated double numbers;
   }
   ```
   
   Or a primitive nested in a group annotated as a list:
   ```
   required group root {
     required group numbers (LIST) {
       repeated group list {
         optional double element;
       }
     }
   }
   ```
   
   The `pqarrow.FileWriter` currently uses the nested group structure when 
creating a Parquet schema.  The changes in this branch make it so the file 
writer conditionally uses a repeated primitive instead - this happens if the 
passed Arrow schema was converted from a Parquet schema that uses a repeated 
primitive.
   
   I'm finding that record writing works when using a Parquet schema with the 
nested group but that it doesn't work when using a schema with a repeated 
primitive.
   
   I tried to write tests to demonstrate the issue based on the latest in 
`main`, but I don't think it can be done using the exported 
`pqarrow.NewFileWriter()` function.  So in an effort to narrow things down, I 
created a similar `pqarrow.NewFileWriterForTesting()` function that accepts 
both a Parquet schema and an Arrow schema (instead of creating the former from 
the latter).
   
   With that new function, this test passes:
   ```go
   func TestRoundTripListOfFloat64(t *testing.T) {
        inputNumbers := []float64{10, 9, 8}
        name := "numbers"
   
        parquetSchema := schema.NewSchema(schema.MustGroup(
                schema.NewGroupNode("root", parquet.Repetitions.Required, 
schema.FieldList{
                        schema.MustGroup(schema.ListOf(
                                schema.NewFloat64Node(name, 
parquet.Repetitions.Optional, -1),
                                parquet.Repetitions.Required,
                                -1,
                        )),
                }, -1),
        ))
   
        outputNumbers := roundTripFloat64s(t, parquetSchema, name, inputNumbers)
        assert.Equal(t, inputNumbers, outputNumbers)
        // this passes
   }
   ```
   
   But this test fails:
   ```go
   func TestRoundTripRepeatedFloat64(t *testing.T) {
        inputNumbers := []float64{10, 9, 8}
        name := "numbers"
   
        parquetSchema := schema.NewSchema(schema.MustGroup(
                schema.NewGroupNode("root", parquet.Repetitions.Required, 
schema.FieldList{
                        schema.NewFloat64Node(name, 
parquet.Repetitions.Repeated, -1),
                }, -1),
        ))
   
        outputNumbers := roundTripFloat64s(t, parquetSchema, name, inputNumbers)
        assert.Equal(t, inputNumbers, outputNumbers)
        // this fails
        // expected: []float64{10, 9, 8}
        // actual  : []float64(nil)
   }
   ```
   
   See 
https://github.com/apache/arrow/compare/main...tschaub:arrow:write-repeated for 
the `roundTripFloat64s` function and the `pqarray.NewFileWriterForTesting`.  
(Those changes are based on the latest in the `main` branch, not the changes in 
this branch).
   
   I've stepped through the Arrow column writer writing code and can confirm 
that the `defLevels` are different in the two tests above (`[3, 3, 3]` for the 
passing case vs `[2, 2, 2]` for the failing case), but I can't make sense of 
why the values are not getting written in the case of the schema with the 
repeated primitive.
   
   I know this is an unsolicited change, and a lot to dig into, but I 
appreciate any thoughts that others have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to