wgtmac commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382752637
> I think it is a great refactor. Thanks a lot for working on it, @wgtmac! In the other hand I've thought about PARQUET-2075 as a request for a new feature in `parquet-cli` that can be used to convert from one parquet file to another with specific configurations. (Later on we might extend it to allow multiple parquet files to be merged/rewritten to one specified and the tool would decide which level of deserialization/serialization is required.) I am fine with handling it in a separate jira but let's make it clear. Either create another jira for this refactor as a prerequisite of PARQUET-2075 or rephrase PARQUET-2075 and create a new for `parquet-cli`. @shangxinli, what do you think? Thanks for your review @gszadovszky - I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind and will be supported later. I will create separate JIRAs as sub-tasks of PARQUET-2075 and work on them progressively. - Putting the original `created_by` into `key_value_metadata` is a good idea. However, it is tricky if a file has been rewritten for several times. What about adding a key named `original_created_by` to `key_value_metadata` and concatenating all old `created_by`s to it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org