joellubi commented on issue #43640: URL: https://github.com/apache/arrow/issues/43640#issuecomment-2289802664
Hi @datbth, sorry about the delay on this. I've been thinking about a high-level approach that would help address this issue as well as #43624, so I incorporated a potential solution into #43679. Let me know what you think. The relevant change is the introduction of the `CustomParquetType` interface which allows Arrow `ExtensionTypes` to control the Parquet `LogicalType` they are written as. It should work out of the box for `JSON` as in your example by just constructing a `JSONType` extension array and sending it to the parquet writer. See the [test](https://github.com/apache/arrow/pull/43679/files#diff-844f7208e9f32de0a40594bc0c56fe247dfd4d7f51a9ba917e65a4aedb35ba74R2148) for example usage. This can also be used to write Parquet `Interval` data as well using the following approach: - Create a new Arrow extension type with a storage type compatible with the Parquet target. - For example it could be called `MonthDayMillis` with storage `FixedSizeBinary<12>`. This doesn't need to be upstreamed, you can maintain the type in your own codebase. - Add the method: ```go func (t *myType) ParquetLogicalType() schema.LogicalType { return schema.IntervalLogicalType{} } ``` - Now there is zero casting or conversion of the underlying data necessary because you're writing Millis instead of Nanos in the first place, so the writer can efficiently write the underlying storage and update the `LogicalType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
