ycyang-26 commented on issue #14798:
URL: https://github.com/apache/arrow/issues/14798#issuecomment-1334680024
> There was discussion on the Zulip by Ashish Paliwal (i don't know his GH
username) for adding Go examples to the Arrow Cookbooks but I don't believe
anything came of it yet. It's on my list to add more examples to the Parquet
documentation but I haven't gotten around to it yet. Currently the best
examples would be the unit tests.
>
> If you have plain Go slices of data and want to write it to a parquet
file, here is a minimal example you can use for now until I can update the
documentation:
>
> ```go
> package main
>
> import (
> "os"
>
> "github.com/apache/arrow/go/v10/parquet"
> "github.com/apache/arrow/go/v10/parquet/file"
> "github.com/apache/arrow/go/v10/parquet/schema"
> )
>
> func main() {
> writeData([]int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, nil)
> }
>
> func writeData(data []int32, defLevels []int16) {
> // Root node should always be Required according to the spec
> sc := schema.MustGroup(schema.NewGroupNode("root",
parquet.Repetitions.Required, schema.FieldList{
> schema.NewInt32Node("data",
parquet.Repetitions.Required, -1),
> }, -1))
> // Alternately you can create the schema from a struct via
schema.NewSchemaFromStruct
>
> f, err := os.Create("test.parquet")
> if err != nil {
> panic(err)
> }
>
> // also accepts a bunch of WriteOptions
> writer := file.NewParquetWriter(f, sc)
> defer writer.Close() // writer.Close will automatically call
f.Close()
>
> // create a row group writer, default is serialized so you have to
write one column at a time
> // but it is very memory efficient. Alternately you can use a
buffered row group writer to allow
> // writing multiple columns at a time if you are writing it in a
row based fashion
> rgw := writer.AppendRowGroup()
>
> cwr, err := rgw.NextColumn()
> if err != nil {
> panic(err)
> }
> cw := cwr.(*file.Int32ColumnChunkWriter)
>
> // len(defLevels) should be equal to len(data) and should contain
a 1 for each index that is non-null
> // and a 0 for each index that is null. Alternately if you pass a
nil for the defLevels slice it will assume
> // that all of the data is non-null.
> // the last argument is repetition levels which are only relevant
for nested data
> cw.WriteBatch(data, defLevels, nil)
>
> // you can explicitly close the column writer and row group
writer, or they will automatically have their
> // Close methods called when writer.Close is called
> cw.Close()
> rgw.Close()
> }
> ```
>
> You can check the documentation here:
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/schema#NewSchemaFromStruct
for examples of creating Schemas by using a Struct and Go struct tags if you
prefer that over manually constructing the parquet Schema Nodes.
>
> The
[`file`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#pkg-types)
package contains all the types for writing and reading files, you'll notice
there is a corresponding `*ColumnChunkWriter` for each parquet primitive type
that has a `WriteBatch` method which takes a slice of that type, such as the
`Int32ColumnChunkWriter` that was in the above example.
>
> To manipulate the properties for the writer such as Batch Size, Data Page
Size, Parquet Version, Data Page Version, etc. you utilize the writer
properties as described here:
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet#NewWriterProperties
and pass them to the writer via WriteOptions as shown in the documentation
here:
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#WithWriterProps
>
> Hopefully this is a good enough example to get you started until I can add
better examples to the documentation directly. In the meantime, as mentioned
earlier, the unit tests are probably the best examples for how to read/write
files. You can also use the
[`pqarrow`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/pqarrow)
package if you already have your data in Arrow format, which can write data to
parquet _directly_ from Arrow data structures.
Thanks for the sample! It's very helpful. Hope to see the document soon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]