[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17676996#comment-17676996
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-----------------------------------------

wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383101451

   > > I am afraid some implementations may drop characters after `'\n'` when 
displaying the string content. Let me do some investigation.
   > 
   > I do not have a strong opinion for `'\n'` only that we need a character 
that probably won't be used by any systems writing parquet files.
   
   As we are discussing a new entry (`original.created.by`) to the key value 
metadata, I need to raise two related issues once we have supported rewriting 
(merging) several files into one:
   - We need to merge `original.created.by` from all input files, making it 
difficult to tell which created_by comes from which input file. Therefore, 
`original.created.by` should be dropped in this case.
   - Is there any key value metadata that will conflict from different input 
files and should be dealt with by the rewriter? For now we simply keep all the 
old key value metadata from the old file.
   
   @gszadovszky @ggershinsky @shangxinli Thoughts?
   
   If this behavior requires further discussion, I'd suggest to keep the 
current state of `created_by` unchanged in this pull request which is large 
enough. All rewriters (ColumnPruner, CompressionConverter, ColumnMasker, and 
ColumnEncrypter) have dropped original `created_by` and store the current 
writer version to the footer.
   
   




> Refactor different file rewriters to use single implementation
> --------------------------------------------------------------
>
>                 Key: PARQUET-2227
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2227
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-mr
>            Reporter: Gang Wu
>            Assignee: Gang Wu
>            Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to