joellubi commented on issue #43640:
URL: https://github.com/apache/arrow/issues/43640#issuecomment-2289802664

   Hi @datbth, sorry about the delay on this. I've been thinking about a 
high-level approach that would help address this issue as well as #43624, so I 
incorporated a potential solution into #43679.
   
   Let me know what you think. The relevant change is the introduction of the 
`CustomParquetType` interface which allows Arrow `ExtensionTypes` to control 
the Parquet `LogicalType` they are written as. It should work out of the box 
for `JSON` as in your example by just constructing a `JSONType` extension array 
and sending it to the parquet writer. See the 
[test](https://github.com/apache/arrow/pull/43679/files#diff-844f7208e9f32de0a40594bc0c56fe247dfd4d7f51a9ba917e65a4aedb35ba74R2148)
 for example usage.
   
   This can also be used to write Parquet `Interval` data as well using the 
following approach:
   - Create a new Arrow extension type with a storage type compatible with the 
Parquet target.
   - For example it could be called `MonthDayMillis` with storage 
`FixedSizeBinary<12>`. This doesn't need to be upstreamed, you can maintain the 
type in your own codebase.
   - Add the method:
   ```go
   func (t *myType) ParquetLogicalType() schema.LogicalType { return 
schema.IntervalLogicalType{} }
   ```
   - Now there is zero casting or conversion of the underlying data necessary 
because you're writing Millis instead of Nanos in the first place, so the 
writer can efficiently write the underlying storage and update the 
`LogicalType`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to