zeroshade commented on issue #14924:
URL: https://github.com/apache/arrow/issues/14924#issuecomment-1349091344
@tuziershi The Go parquet library *does* support repeated fields! You have
two options when reading them:
You can read the repetition levels directly like so
```go
import (
...
"github.com/apache/arrow/go/v10/parquet/file"
...
)
func readFile(fname string) {
rdr, err := file.OpenParquetFile(fname, false /* use_mmap */)
if err != nil {
panic(err)
}
defer rdr.Close()
var (
values = make([]int32, 1024)
// slice to hold definition levels
defLevels = make([]int16, 1024)
// slice to hold repetition levels
repLevels = make([]int16, 1024)
)
for i := 0; i < rdr.NumRowGroups(); i++ {
rowGroupRdr := rdr.RowGroup(i)
colRdr, err := rowGroupRdr.Column(0)
if err != nil {
panic(err)
}
// I'm going to assume the first column is int32, but you could use a
// type switch or otherwise to get the correctly typed
ColumnChunkReader
cr := colRdr.(*file.Int32ColumnChunkReader)
totalRows, physicalValuesRead, err := cr.ReadBatch(1024, values,
defLevels, repLevels)
if err != nil {
panic(err)
}
// defLevels and repLevels will be aligned from indices 0 ->
totalRows
// values will be populated from 0 -> physicalValuesRead, nulls are
excluded
// you can process the defLevels and repLevels to work out repeated
lists etc.
}
}
```
You can use the links that @drin commented with to learn how to process /
handle the repetition levels and definition levels to work with the data.
Alternately: You can use the `pqarrow` package to read from Parquet directly
into an Arrow nested column:
```go
import (
...
"github.com/apache/arrow/go/v10/arrow/memory"
"github.com/apache/arrow/go/v10/parquet/file"
"github.com/apache/arrow/go/v10/parquet/pqarrow"
...
)
func readFile(fname string) {
rdr, err := file.OpenParquetFile(fname, false /* use_mmap */)
if err != nil {
panic(err)
}
defer rdr.Close()
arrRdr, err := pqarrow.NewFileReader(rdr, pqarrow.ArrowReadProperties{},
memory.DefaultAllocator)
if err != nil {
panic(err)
}
// there's various ways you could read the data if you like
//
(https://pkg.go.dev/github.com/apache/arrow/go/v10/parquet/pqarrow#FileReader)
// You can fetch a single column, multiple columns, record batches, or
read the whole file as an arrow.Table
tbl, err := arrRdr.ReadTable(context.Background())
if err != nil {
panic(err)
}
defer tbl.Release()
// tbl is an arrow.Table which will contain the data and repeated
columns will translate to array.List arrow Arrays
}
```
I hope this helps!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]