Re: [I] Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet [arrow]

via GitHub Sat, 22 Mar 2025 05:33:10 -0700


ryancasburn-KAI commented on issue #43442:
URL: https://github.com/apache/arrow/issues/43442#issuecomment-2745248359


   @alexshpilkin I did find a work around for that issue. `use_dictionary` can 
also take a list (of column names) in addition to a Boolean. 
   
   So if you have:
   - ColumnA: RLE_DICTIONARY
   - ColumnB: DELTA_BINARY_PACKED
   - ColumnC: RLE_DICTIONARY
   - ColumnD: DELTA_BYTE_ARRAY 
   - ColumnE: RLE_DICTIONARY
   
   You can do:
   
   `pq.write_table(table, where, use_dictionary=[“ColumnA”, “ColumnC”, 
“ColumnE”], column_encoding={“ColumnB”:”DELTA_BINARY_PACKED”, “ColumnD”: 
“DELTA_BYTE_ARRAY”})`
   
   This works, but:
   1. It is kind of clunky. You now have to put all of your column names in 
your write table call.
   2. It doesn’t have the flexibility of the fallback approach that is in the 
CPP documentation I quoted above. You have to be confident that the 
DELTA_BYTE_ARRAY (or whatever you select) is actually going to be better than 
dictionary (and better for all row groups). 
   
   I think this area could use a re-work to improve usability


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet [arrow]

Reply via email to