zeroshade commented on issue #34751: URL: https://github.com/apache/arrow/issues/34751#issuecomment-1493131912
> I think including KV pairs was the answer I was looking for. Does this library support reading and writing arbitrary KV metadata? I don't see any way to do this with parquet readers / writers. If you look at the [`file.NewParquetWriter`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#NewParquetWriter) function, you can add an arbitrary number of `WriteOption`s when creating the writer. One of the options is [`WithWriteMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#WithWriteMetadata), which allows you to provide the key value metadata to write for this file. The metadata can be manipulated via [`metadata.KeyValueMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#KeyValueMetadata). When reading a file, you can use the [`MetaData`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#Reader.MetaData) method of the reader, to retrieve the file level metadata, and the [`KeyValueMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#FileMetaData.KeyValueMetadata) method on the `FileMetaData` object will return back those Key Value pairs from the file. > Say I have strings and timestamps (uint64) stored in memory. This library only supports timestamps written as int64. To work around this, I was considering writing them as strings using the string logical type. The problem is that a reader (which reads the parquet file back into memory) will not know weather to interpret a string as a string or timestamp because the logical type is the same. This is because the Parquet specification states that timestamps should be written as an `int64` column with a timestamp logical type. In fact there is no physical `uint64` type for Parquet, unsigned types are a "logical" type annotated on a column. That said, it's pretty trivial (and likely more performant) to just convert your `[]uint64` timestamps into a `[]int64` to write them out than it would be to convert them to strings, right? But if you really want to convert them to strings, you can use the metadata functions I mentioned above for writing and reading the metadata. > Say that I want to store a protobuf column which will physically be stored as bytes. A reader will not know how to decode the bytes from the file unless there is some logical type / metadata which indicates the logical type of the column. Right, this can also be achieved with the key value metadata specified at write time and then read back as long as you communicate ahead of time what the Key is that a consumer should be reading. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
