zeroshade commented on issue #34751:
URL: https://github.com/apache/arrow/issues/34751#issuecomment-1493131912

   > I think including KV pairs was the answer I was looking for. Does this 
library support reading and writing arbitrary KV metadata? I don't see any way 
to do this with parquet readers / writers.
   
   If you look at the 
[`file.NewParquetWriter`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#NewParquetWriter)
 function, you can add an arbitrary number of `WriteOption`s when creating the 
writer. One of the options is 
[`WithWriteMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#WithWriteMetadata),
 which allows you to provide the key value metadata to write for this file.  
The metadata can be manipulated via 
[`metadata.KeyValueMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#KeyValueMetadata).
   
   When reading a file, you can use the 
[`MetaData`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/file#Reader.MetaData)
 method of the reader, to retrieve the file level metadata, and the 
[`KeyValueMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/parquet/metadata#FileMetaData.KeyValueMetadata)
 method on the `FileMetaData` object will return back those Key Value pairs 
from the file.
   
   > Say I have strings and timestamps (uint64) stored in memory. This library 
only supports timestamps written as int64. To work around this, I was 
considering writing them as strings using the string logical type. The problem 
is that a reader (which reads the parquet file back into memory) will not know 
weather to interpret a string as a string or timestamp because the logical type 
is the same.
   
   This is because the Parquet specification states that timestamps should be 
written as an `int64` column with a timestamp logical type. In fact there is no 
physical `uint64` type for Parquet, unsigned types are a "logical" type 
annotated on a column. That said, it's pretty trivial (and likely more 
performant) to just convert your `[]uint64` timestamps into a `[]int64` to 
write them out than it would be to convert them to strings, right? But if you 
really want to convert them to strings, you can use the metadata functions I 
mentioned above for writing and reading the metadata.
   
   > Say that I want to store a protobuf column which will physically be stored 
as bytes. A reader will not know how to decode the bytes from the file unless 
there is some logical type / metadata which indicates the logical type of the 
column.
   
   Right, this can also be achieved with the key value metadata specified at 
write time and then read back as long as you communicate ahead of time what the 
Key is that a consumer should be reading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to