wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382752637

   
   
   
   > I think it is a great refactor. Thanks a lot for working on it, @wgtmac! 
In the other hand I've thought about PARQUET-2075 as a request for a new 
feature in `parquet-cli` that can be used to convert from one parquet file to 
another with specific configurations. (Later on we might extend it to allow 
multiple parquet files to be merged/rewritten to one specified and the tool 
would decide which level of deserialization/serialization is required.) I am 
fine with handling it in a separate jira but let's make it clear. Either create 
another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase 
PARQUET-2075 and create a new for `parquet-cli`. @shangxinli, what do you think?
   
   Thanks for your review @gszadovszky 
   
   - I'd prefer creating a new JIRA for this refactor to be a prerequisite. 
Merging multiple files to a single one with customized pruning, encryption, and 
codec is also in my mind and will be supported later. I will create separate 
JIRAs as sub-tasks of PARQUET-2075 and work on them progressively.
   - Putting the original `created_by` into `key_value_metadata` is a good 
idea. However, it is tricky if a file has been rewritten for several times. 
What about adding a key named `original_created_by` to `key_value_metadata` and 
concatenating all old `created_by`s to it?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to