seongkim0228 opened a new issue, #628:
URL: https://github.com/apache/arrow-go/issues/628
### Describe the bug, including details regarding any error messages,
version, and platform.
When reading a subset of columns from a struct field (partial column
projection), the getReader function in pqarrow/file_reader.go fails to filter
childFields in sync with childReaders, causing a length mismatch and potential
panic.
In file_reader.go lines 594-604, only childReaders is pruned to remove nil
entries, but childFields retains zero-valued arrow.Field{} entries for skipped
children:
```
childReaders = slices.DeleteFunc(childReaders, func(r *ColumnReader) bool
{ return r == nil })// childFields is NOT filtered here!
```
This causes childFields and childReaders to have mismatched lengths when
passed to newStructReader.
Reproduction
```
package main
import (
"bytes"
"context"
"github.com/apache/arrow-go/v18/arrow"
"github.com/apache/arrow-go/v18/arrow/array"
"github.com/apache/arrow-go/v18/arrow/memory"
"github.com/apache/arrow-go/v18/parquet/file"
"github.com/apache/arrow-go/v18/parquet/pqarrow"
)
func main() {
schema := arrow.NewSchema([]arrow.Field{
{Name: "nested", Type: arrow.StructOf(
arrow.Field{Name: "a", Type:
arrow.PrimitiveTypes.Float64},
arrow.Field{Name: "b", Type:
arrow.PrimitiveTypes.Float64},
)},
}, nil)
// Write a parquet file
buf := new(bytes.Buffer)
writer, _ := pqarrow.NewFileWriter(schema, buf, nil,
pqarrow.DefaultWriterProps())
b := array.NewRecordBuilder(memory.DefaultAllocator, schema)
sb := b.Field(0).(*array.StructBuilder)
sb.Append(true)
sb.FieldBuilder(0).(*array.Float64Builder).Append(1.0)
sb.FieldBuilder(1).(*array.Float64Builder).Append(2.0)
writer.Write(b.NewRecord())
writer.Close()
// Read with partial column selection
pf, _ := file.NewParquetReader(bytes.NewReader(buf.Bytes()))
fr, _ := pqarrow.NewFileReader(pf, pqarrow.ArrowReadProperties{},
memory.DefaultAllocator)
// Only read nested.a (leaf index 0), not nested.b (leaf index 1)
partialLeaves := map[int]bool{0: true}
fieldIdx, _ := fr.Manifest.GetFieldIndices([]int{0})
// This panics due to childFields/childReaders length mismatch
fr.GetFieldReader(context.Background(), fieldIdx[0], partialLeaves,
[]int{0})
}
```
Expected Behavior
Partial struct column reads should work correctly, returning a reader for
only the selected fields.
Actual Behavior
Panic or undefined behavior due to mismatched slice lengths.
Suggested Fix
Filter childFields alongside childReaders:
```
childReaders = slices.DeleteFunc(childReaders,
func(r *ColumnReader) bool { return r == nil })
childFields = slices.DeleteFunc(childFields,
func(f arrow.Field) bool { return f.Type == nil })
if len(childReaders) == 0 {
return nil, nil
}
```
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]