zeroshade commented on issue #14798:
URL: https://github.com/apache/arrow/issues/14798#issuecomment-1334019092
There was discussion on the Zulip by Ashish Paliwal (i don't know his GH
username) for adding Go examples to the Arrow Cookbooks but I don't believe
anything came of it yet. It's on my list to add more examples to the Parquet
documentation but I haven't gotten around to it yet. Currently the best
examples would be the unit tests.
If you have plain Go slices of data and want to write it to a parquet file,
here is a minimal example you can use for now until I can update the
documentation:
```go
package main
import (
"os"
"github.com/apache/arrow/go/v10/parquet"
"github.com/apache/arrow/go/v10/parquet/file"
"github.com/apache/arrow/go/v10/parquet/schema"
)
func main() {
writeData([]int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, nil)
}
func writeData(data []int32, defLevels []int16) {
// Root node should always be Required according to the spec
sc := schema.MustGroup(schema.NewGroupNode("root",
parquet.Repetitions.Required, schema.FieldList{
schema.NewInt32Node("data", parquet.Repetitions.Required,
-1),
}, -1))
// Alternately you can create the schema from a struct via
schema.NewSchemaFromStruct
f, err := os.Create("test.parquet")
if err != nil {
panic(err)
}
// also accepts a bunch of WriteOptions
writer := file.NewParquetWriter(f, sc)
defer writer.Close() // writer.Close will automatically call
f.Close()
// create a row group writer, default is serialized so you have to
write one column at a time
// but it is very memory efficient. Alternately you can use a
buffered row group writer to allow
// writing multiple columns at a time if you are writing it in a row
based fashion
rgw := writer.AppendRowGroup()
cwr, err := rgw.NextColumn()
if err != nil {
panic(err)
}
cw := cwr.(*file.Int32ColumnChunkWriter)
// len(defLevels) should be equal to len(data) and should contain a
1 for each index that is non-null
// and a 0 for each index that is null. Alternately if you pass a
nil for the defLevels slice it will assume
// that all of the data is non-null.
// the last argument is repetition levels which are only relevant
for nested data
cw.WriteBatch(data, defLevels, nil)
// you can explicitly close the column writer and row group writer,
or they will automatically have their
// Close methods called when writer.Close is called
cw.Close()
rgw.Close()
}
```
You can check the documentation here:
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/schema#NewSchemaFromStruct
for examples of creating Schemas by using a Struct and Go struct tags if you
prefer that over manually constructing the parquet Schema Nodes.
The
[`file`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#pkg-types)
package contains all the types for writing and reading files, you'll notice
there is a corresponding `*ColumnChunkWriter` for each parquet primitive type
that has a `WriteBatch` method which takes a slice of that type, such as the
`Int32ColumnChunkWriter` that was in the above example.
To manipulate the properties for the writer such as Batch Size, Data Page
Size, Parquet Version, Data Page Version, etc. you utilize the writer
properties as described here:
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet#NewWriterProperties
and pass them to the writer via WriteOptions as shown in the documentation
here:
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#WithWriterProps
Hopefully this is a good enough example to get you started until I can add
better examples to the documentation directly. In the meantime, as mentioned
earlier, the unit tests are probably the best examples for how to read/write
files. You can also use the
[`pqarrow`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/pqarrow)
package if you already have your data in Arrow format, which can write data to
parquet *directly* from Arrow data structures.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]