seongkim0228 opened a new issue, #628:
URL: https://github.com/apache/arrow-go/issues/628

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When reading a subset of columns from a struct field (partial column 
projection), the getReader function in pqarrow/file_reader.go fails to filter 
childFields in sync with childReaders, causing a length mismatch and potential 
panic.
   
   In file_reader.go lines 594-604, only childReaders is pruned to remove nil 
entries, but childFields retains zero-valued arrow.Field{} entries for skipped 
children:
   ```
   childReaders = slices.DeleteFunc(childReaders,    func(r *ColumnReader) bool 
{ return r == nil })// childFields is NOT filtered here!
   ```
   This causes childFields and childReaders to have mismatched lengths when 
passed to newStructReader.
   
   Reproduction
   ```
   package main
   
   import (
        "bytes"
        "context"
   
        "github.com/apache/arrow-go/v18/arrow"
        "github.com/apache/arrow-go/v18/arrow/array"
        "github.com/apache/arrow-go/v18/arrow/memory"
        "github.com/apache/arrow-go/v18/parquet/file"
        "github.com/apache/arrow-go/v18/parquet/pqarrow"
   )
   
   func main() {
        schema := arrow.NewSchema([]arrow.Field{
                {Name: "nested", Type: arrow.StructOf(
                        arrow.Field{Name: "a", Type: 
arrow.PrimitiveTypes.Float64},
                        arrow.Field{Name: "b", Type: 
arrow.PrimitiveTypes.Float64},
                )},
        }, nil)
   
        // Write a parquet file
        buf := new(bytes.Buffer)
        writer, _ := pqarrow.NewFileWriter(schema, buf, nil, 
pqarrow.DefaultWriterProps())
        b := array.NewRecordBuilder(memory.DefaultAllocator, schema)
        sb := b.Field(0).(*array.StructBuilder)
        sb.Append(true)
        sb.FieldBuilder(0).(*array.Float64Builder).Append(1.0)
        sb.FieldBuilder(1).(*array.Float64Builder).Append(2.0)
        writer.Write(b.NewRecord())
        writer.Close()
   
        // Read with partial column selection
        pf, _ := file.NewParquetReader(bytes.NewReader(buf.Bytes()))
        fr, _ := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{}, 
memory.DefaultAllocator)
   
        // Only read nested.a (leaf index 0), not nested.b (leaf index 1)
        partialLeaves := map[int]bool{0: true}
        fieldIdx, _ := fr.Manifest.GetFieldIndices([]int{0})
   
        // This panics due to childFields/childReaders length mismatch
        fr.GetFieldReader(context.Background(), fieldIdx[0], partialLeaves, 
[]int{0})
   }
   ```
   Expected Behavior
   Partial struct column reads should work correctly, returning a reader for 
only the selected fields.
   
   Actual Behavior
   Panic or undefined behavior due to mismatched slice lengths.
   
   Suggested Fix
   Filter childFields alongside childReaders:
   ```
   childReaders = slices.DeleteFunc(childReaders,
       func(r *ColumnReader) bool { return r == nil })
   childFields = slices.DeleteFunc(childFields,
       func(f arrow.Field) bool { return f.Type == nil })
   
   if len(childReaders) == 0 {
       return nil, nil
   }
   ```
   
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to