[GitHub] [arrow] zeroshade commented on issue #14798: [parquet go] write parquet data code sample

GitBox Thu, 01 Dec 2022 08:20:59 -0800


zeroshade commented on issue #14798:
URL: https://github.com/apache/arrow/issues/14798#issuecomment-1334019092


   There was discussion on the Zulip by Ashish Paliwal (i don't know his GH 
username) for adding Go examples to the Arrow Cookbooks but I don't believe 
anything came of it yet. It's on my list to add more examples to the Parquet 
documentation but I haven't gotten around to it yet. Currently the best 
examples would be the unit tests.
   
   If you have plain Go slices of data and want to write it to a parquet file, 
here is a minimal example you can use for now until I can update the 
documentation:
   
   ```go
   package main
   
   import (
           "os"
   
           "github.com/apache/arrow/go/v10/parquet"
           "github.com/apache/arrow/go/v10/parquet/file"
           "github.com/apache/arrow/go/v10/parquet/schema"
   )
   
   func main() {
           writeData([]int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, nil)
   }
   
   func writeData(data []int32, defLevels []int16) {
           // Root node should always be Required according to the spec
           sc := schema.MustGroup(schema.NewGroupNode("root", 
parquet.Repetitions.Required, schema.FieldList{
                     schema.NewInt32Node("data", parquet.Repetitions.Required, 
-1),
            }, -1))
           // Alternately you can create the schema from a struct via 
schema.NewSchemaFromStruct
   
           f, err := os.Create("test.parquet")
           if err != nil {
                   panic(err)
           }
   
           // also accepts a bunch of WriteOptions
           writer := file.NewParquetWriter(f, sc)
           defer writer.Close() // writer.Close will automatically call 
f.Close()
   
           // create a row group writer, default is serialized so you have to 
write one column at a time
           // but it is very memory efficient. Alternately you can use a 
buffered row group writer to allow
           // writing multiple columns at a time if you are writing it in a row 
based fashion
           rgw := writer.AppendRowGroup()
   
           cwr, err := rgw.NextColumn()
           if err != nil {
                   panic(err)
           }
           cw := cwr.(*file.Int32ColumnChunkWriter)
           
           // len(defLevels) should be equal to len(data) and should contain a 
1 for each index that is non-null
           // and a 0 for each index that is null. Alternately if you pass a 
nil for the defLevels slice it will assume
           // that all of the data is non-null.
           // the last argument is repetition levels which are only relevant 
for nested data
           cw.WriteBatch(data, defLevels, nil) 
   
           // you can explicitly close the column writer and row group writer, 
or they will automatically have their
           // Close methods called when writer.Close is called
           cw.Close()
           rgw.Close()
   }
   ```
   
   You can check the documentation here: 
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/schema#NewSchemaFromStruct
 for examples of creating Schemas by using a Struct and Go struct tags if you 
prefer that over manually constructing the parquet Schema Nodes.
   
   The 
[`file`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#pkg-types)
 package contains all the types for writing and reading files, you'll notice 
there is a corresponding `*ColumnChunkWriter` for each parquet primitive type 
that has a `WriteBatch` method which takes a slice of that type, such as the 
`Int32ColumnChunkWriter` that was in the above example.
   
   To manipulate the properties for the writer such as Batch Size, Data Page 
Size, Parquet Version, Data Page Version, etc. you utilize the writer 
properties as described here: 
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet#NewWriterProperties
 and pass them to the writer via WriteOptions as shown in the documentation 
here: 
https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#WithWriterProps
   
   Hopefully this is a good enough example to get you started until I can add 
better examples to the documentation directly. In the meantime, as mentioned 
earlier, the unit tests are probably the best examples for how to read/write 
files. You can also use the 
[`pqarrow`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/pqarrow)
 package if you already have your data in Arrow format, which can write data to 
parquet *directly* from Arrow data structures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zeroshade commented on issue #14798: [parquet go] write parquet data code sample

Reply via email to