[ 
https://issues.apache.org/jira/browse/ARROW-18288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634070#comment-17634070
 ] 

Matthew Topol commented on ARROW-18288:
---------------------------------------

This isn't quite as straightforward as denormalizing the values in order to 
ensure proper statistics handling and efficient propagation. (Sure, you could 
just naively denormalize and then write, but that could cause unnecessary 
copies and other inefficient handling). I've started working on this but ran 
into a couple snags where I will need to utilize and enhance the compute 
package. I'll have a few things up for this soon as I work it out as this is 
going to require:

* Enabling proper casting from Dictionary types to values (unpacking 
dictionaries)
* One of the following:
** Implementing hash kernels for the Compute module to efficiently perform a 
{{unique}} operation on the dictionary indexes to find the min/max for stats
** Implementing aggregation kernels to implement using MinMax to find the 
min-max on the dictionary array directly (more efficient than hash for 
uniqueness but will take longer / harder to do).

> [GO]: pqarrow (github.com/apache/arrow/go/v9/parquet/pqarrow) cannot handle 
> arrow's DICTIONARY field
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-18288
>                 URL: https://issues.apache.org/jira/browse/ARROW-18288
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Go, Parquet
>    Affects Versions: 9.0.0, 10.0.0
>            Reporter: Kevin Yang
>            Assignee: Matthew Topol
>            Priority: Minor
>
> Hey, Arrow Go Dev:
>  
> I was trying to save some arrow tables out to parquet files, with the help of 
> the 
> "[github.com/apache/arrow/go/v9/parquet/pqarrow|http://github.com/apache/arrow/go/v9/parquet/pqarrow]";
>  package. btw, it's generally a great design (of Arrow) and a great Go 
> implementation. 
>  
> However, one issue sticks out: in my original arrow Table I have some 
> DICTIONARY fields, which pqarrow does NOT currently support.
>  
> I would assume supporting them will be quite straightward: just "denormalize" 
> the DICTIONARY value into corresponding values (string, Timestamp, etc), and 
> it's up to the Parquet to do the right trick (using DICTIONARY encoding, 
> etc). 
>  
> I would have done this conversion on-the-fly by myself, by converting each 
> DICTIONARY field into underlying values. However, the arrow table schema is 
> dynamic and outside my control, and I need to iterate through fields (maybe 
> structs) to locate those) -> it would be much better if pqarrow can support 
> this natively. 
>  
> Can anyone help? thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to