[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681825#comment-17681825
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1407856755

   Thank you @gszadovszky @ggershinsky @shangxinli




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681802#comment-17681802
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

gszadovszky merged PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681762#comment-17681762
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

shangxinli commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1407713001

   Thanks @wgtmac for working on this and Thanks @gszadovszky and @ggershinsky 
for reviewing it! I am a little late for the comments discussion but I see we 
are in the right direction. Let's address it in a separate discussion. If it 
turns out that changing the parquet-format is the right way to solve it, we can 
make the proposal and I can help for the approval. 
   
   My comments are all addressed. I don't have further comments. 




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681730#comment-17681730
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

gszadovszky commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1407653245

   Thank you, @wgtmac for working on this! It looks good to me.
   Since there were other reviewers, I would wait a bit to give a chance to 
them for add feedback. I'll push it after a couple of days. (Feel free to ping 
me if I forget.)




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681716#comment-17681716
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1407613955

   Gentle ping. @gszadovszky @ggershinsky @shangxinli
   
   Any chance to take another look?




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677039#comment-17677039
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383160117

   > I agree that merging the key-value metadata is not an easy question. Let's 
discuss it separately as it is not related to this PR.
   > 
   > I also agree to store the current writer (parquet-mr) in `created_by` in 
case of rewriting. It is not easy to decide what would be the proper solution 
anyway. `created_by` is usually used for handling potential erroneous writes. 
Let's say there was an issue in parquet-mr at the version 1.2.3 that written a 
specific encoding of integers wrongly (not according to spec). What if we 
rewrite the file but do not re-encode the pages? Can we still handle the 
original issue? What if the rewriter re-encodes the related pages? Let's store 
the original writer in `original.created.by` for now. Let's discuss this topic 
separately however, I am not sure if we can find a proper solution.
   
   I agree. Now I have updated this PR to preserve the old writer version into 
`original.created.by` and added a test to make sure it is preserved. Please 
take a look. Thanks!




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677036#comment-17677036
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

gszadovszky commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383152341

   I agree that merging the key-value metadata is not an easy question. Let's 
discuss it separately as it is not related to this PR.
   
   I also agree to store the current writer (parquet-mr) in `created_by` in 
case of rewriting.  It is not easy to decide what would be the proper solution 
anyway. `created_by` is usually used for handling potential erroneous writes. 
Let's say there was an issue in parquet-mr at the version 1.2.3 that written a 
specific encoding of integers wrongly (not according to spec). What if we 
rewrite the file but do not re-encode the pages? Can we still handle the 
original issue? What if the rewriter re-encodes the related pages?
   Let's store the original writer in `original.created.by` for now. Let's 
discuss this topic separately however, I am not sure if we can find a proper 
solution.




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676996#comment-17676996
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383101451

   > > I am afraid some implementations may drop characters after `'\n'` when 
displaying the string content. Let me do some investigation.
   > 
   > I do not have a strong opinion for `'\n'` only that we need a character 
that probably won't be used by any systems writing parquet files.
   
   As we are discussing a new entry (`original.created.by`) to the key value 
metadata, I need to raise two related issues once we have supported rewriting 
(merging) several files into one:
   - We need to merge `original.created.by` from all input files, making it 
difficult to tell which created_by comes from which input file. Therefore, 
`original.created.by` should be dropped in this case.
   - Is there any key value metadata that will conflict from different input 
files and should be dealt with by the rewriter? For now we simply keep all the 
old key value metadata from the old file.
   
   @gszadovszky @ggershinsky @shangxinli Thoughts?
   
   If this behavior requires further discussion, I'd suggest to keep the 
current state of `created_by` unchanged in this pull request which is large 
enough. All rewriters (ColumnPruner, CompressionConverter, ColumnMasker, and 
ColumnEncrypter) have dropped original `created_by` and store the current 
writer version to the footer.
   
   




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2227) Refactor different file rewriters to use single implementation

2023-01-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676927#comment-17676927
 ] 

ASF GitHub Bot commented on PARQUET-2227:
-

gszadovszky commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382840526

   > I am afraid some implementations may drop characters after `'\n'` when 
displaying the string content. Let me do some investigation.
   
   I do not have a strong opinion for `'\n'` only that we need a character that 
probably won't be used by any systems writing parquet files.




> Refactor different file rewriters to use single implementation
> --
>
> Key: PARQUET-2227
> URL: https://issues.apache.org/jira/browse/PARQUET-2227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> A new ParquetRewriter is implemented to support all logics in the 
> ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter. And 
> refactor all the old rewriters to use ParquetRewriter under the hood.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)