[GitHub] [arrow] ycyang-26 commented on issue #14798: [parquet go] write parquet data code sample

GitBox Thu, 01 Dec 2022 18:18:35 -0800


ycyang-26 commented on issue #14798:
URL: https://github.com/apache/arrow/issues/14798#issuecomment-1334680024


   > There was discussion on the Zulip by Ashish Paliwal (i don't know his GH 
username) for adding Go examples to the Arrow Cookbooks but I don't believe 
anything came of it yet. It's on my list to add more examples to the Parquet 
documentation but I haven't gotten around to it yet. Currently the best 
examples would be the unit tests.
   > 
   > If you have plain Go slices of data and want to write it to a parquet 
file, here is a minimal example you can use for now until I can update the 
documentation:
   > 
   > ```go
   > package main
   > 
   > import (
   >         "os"
   > 
   >         "github.com/apache/arrow/go/v10/parquet"
   >         "github.com/apache/arrow/go/v10/parquet/file"
   >         "github.com/apache/arrow/go/v10/parquet/schema"
   > )
   > 
   > func main() {
   >         writeData([]int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, nil)
   > }
   > 
   > func writeData(data []int32, defLevels []int16) {
   >         // Root node should always be Required according to the spec
   >         sc := schema.MustGroup(schema.NewGroupNode("root", 
parquet.Repetitions.Required, schema.FieldList{
   >                   schema.NewInt32Node("data", 
parquet.Repetitions.Required, -1),
   >          }, -1))
   >         // Alternately you can create the schema from a struct via 
schema.NewSchemaFromStruct
   > 
   >         f, err := os.Create("test.parquet")
   >         if err != nil {
   >                 panic(err)
   >         }
   > 
   >         // also accepts a bunch of WriteOptions
   >         writer := file.NewParquetWriter(f, sc)
   >         defer writer.Close() // writer.Close will automatically call 
f.Close()
   > 
   >         // create a row group writer, default is serialized so you have to 
write one column at a time
   >         // but it is very memory efficient. Alternately you can use a 
buffered row group writer to allow
   >         // writing multiple columns at a time if you are writing it in a 
row based fashion
   >         rgw := writer.AppendRowGroup()
   > 
   >         cwr, err := rgw.NextColumn()
   >         if err != nil {
   >                 panic(err)
   >         }
   >         cw := cwr.(*file.Int32ColumnChunkWriter)
   >         
   >         // len(defLevels) should be equal to len(data) and should contain 
a 1 for each index that is non-null
   >         // and a 0 for each index that is null. Alternately if you pass a 
nil for the defLevels slice it will assume
   >         // that all of the data is non-null.
   >         // the last argument is repetition levels which are only relevant 
for nested data
   >         cw.WriteBatch(data, defLevels, nil) 
   > 
   >         // you can explicitly close the column writer and row group 
writer, or they will automatically have their
   >         // Close methods called when writer.Close is called
   >         cw.Close()
   >         rgw.Close()
   > }
   > ```
   > 
   > You can check the documentation here: 
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/schema#NewSchemaFromStruct
 for examples of creating Schemas by using a Struct and Go struct tags if you 
prefer that over manually constructing the parquet Schema Nodes.
   > 
   > The 
[`file`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#pkg-types)
 package contains all the types for writing and reading files, you'll notice 
there is a corresponding `*ColumnChunkWriter` for each parquet primitive type 
that has a `WriteBatch` method which takes a slice of that type, such as the 
`Int32ColumnChunkWriter` that was in the above example.
   > 
   > To manipulate the properties for the writer such as Batch Size, Data Page 
Size, Parquet Version, Data Page Version, etc. you utilize the writer 
properties as described here: 
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet#NewWriterProperties
 and pass them to the writer via WriteOptions as shown in the documentation 
here: 
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#WithWriterProps
   > 
   > Hopefully this is a good enough example to get you started until I can add 
better examples to the documentation directly. In the meantime, as mentioned 
earlier, the unit tests are probably the best examples for how to read/write 
files. You can also use the 
[`pqarrow`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/pqarrow)
 package if you already have your data in Arrow format, which can write data to 
parquet _directly_ from Arrow data structures.
   
   Thanks for the sample! It's very helpful. Hope to see the document soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ycyang-26 commented on issue #14798: [parquet go] write parquet data code sample

Reply via email to