[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-12-02 Thread Julien Le Dem (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242864#comment-17242864
 ] 

Julien Le Dem commented on PARQUET-1666:


that sounds good to me too

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli

2020-12-02 Thread Pavi Subenderan (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242818#comment-17242818
 ] 

Pavi Subenderan commented on PARQUET-1801:
--

[~gszadovszky] I am helping Xinli with bringing prune to parquet-cli. I'm 
expecting to open a PR by Monday. Hope this fits into the schedule for 1.12.0.

> Add column index support for 'prune' command in Parquet-tools/cli
> -
>
> Key: PARQUET-1801
> URL: https://issues.apache.org/jira/browse/PARQUET-1801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242634#comment-17242634
 ] 

Xinli Shang commented on PARQUET-1901:
--

For now, I think we can move it to the next release. 

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242631#comment-17242631
 ] 

Xinli Shang edited comment on PARQUET-1927 at 12/2/20, 7:05 PM:


It is still not decided yet in the last Iceberg meeting. But I think if adding 
the 'skipped number of records' is minimal for us,  we can go ahead just to add 
it. Otherwise, we can release without this. 

Add [~rdblue] for FYI


was (Author: sha...@uber.com):
It is still not decided yet in the last Iceberg meeting. But I think if adding 
the 'skipped number of records' is minimal for us,  we can go ahead just to add 
it. Otherwise, we can release without this. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242631#comment-17242631
 ] 

Xinli Shang commented on PARQUET-1927:
--

It is still not decided yet in the last Iceberg meeting. But I think if adding 
the 'skipped number of records' is minimal for us,  we can go ahead just to add 
it. Otherwise, we can release without this. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242625#comment-17242625
 ] 

Xinli Shang commented on PARQUET-1666:
--

I think adding "-deprecated" is a good idea. 

[~zhenxiao], can you help us to know if dropping parquet-scooge module in 
paruqet-mr repo is OK for Twitter usage? 

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242547#comment-17242547
 ] 

Gabor Szadovszky commented on PARQUET-1666:
---

[~julienledem], [~sha...@uber.com], what is the status of this? Do we want to 
drop Hive modules in 1.12.0? What about Scrooge?

I think, the deprecation in the module description is not enough. The users 
won't catch it. I think it would be a better way to add "-deprecated" suffix to 
the artifact name so the users have to rename their dependencies when upgrading 
parquet. What do you think?

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1676) Remove hive modules

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242545#comment-17242545
 ] 

Gabor Szadovszky commented on PARQUET-1676:
---

[~fokko], any updates on this?

> Remove hive modules
> ---
>
> Key: PARQUET-1676
> URL: https://issues.apache.org/jira/browse/PARQUET-1676
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.10.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Remove the hive modules as discusses in the Parquet sync.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1677) Bump Apache Pig from 0.16.0 to 0.17.0

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242542#comment-17242542
 ] 

Gabor Szadovszky commented on PARQUET-1677:
---

[~fokko], do you want to pick this up and finalize for 1.12.0?

> Bump Apache Pig from 0.16.0 to 0.17.0
> -
>
> Key: PARQUET-1677
> URL: https://issues.apache.org/jira/browse/PARQUET-1677
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-pig
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1942) Bump Apache Arrow 2.0.0

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242540#comment-17242540
 ] 

Gabor Szadovszky commented on PARQUET-1942:
---

[~fokko], do you want to work on it for 1.12.0? We would like to start the 
release process for 1.12.0 soon.

> Bump Apache Arrow 2.0.0
> ---
>
> Key: PARQUET-1942
> URL: https://issues.apache.org/jira/browse/PARQUET-1942
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1941) Bump Commons CLI from 1.3.1 to 1.4

2020-12-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1941.
---
Resolution: Fixed

> Bump Commons CLI from 1.3.1 to 1.4
> --
>
> Key: PARQUET-1941
> URL: https://issues.apache.org/jira/browse/PARQUET-1941
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242538#comment-17242538
 ] 

Gabor Szadovszky commented on PARQUET-1927:
---

What is the current status of this one? Is it a blocker for 1.12.0?

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242537#comment-17242537
 ] 

Gabor Szadovszky commented on PARQUET-1801:
---

As of prune is already implemented in parquet-tools it would be nice to have it 
parquet-cli as well. Are you able to do this for 1.12.0?

> Add column index support for 'prune' command in Parquet-tools/cli
> -
>
> Key: PARQUET-1801
> URL: https://issues.apache.org/jira/browse/PARQUET-1801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242535#comment-17242535
 ] 

Gabor Szadovszky commented on PARQUET-1901:
---

Still want to work on this for 1.12.0?

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1800) Add 'prune' command to parquet-cli

2020-12-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1800:
--
Fix Version/s: (was: 1.12.0)

Removed the target 1.12.0

> Add 'prune' command to parquet-cli
> --
>
> Key: PARQUET-1800
> URL: https://issues.apache.org/jira/browse/PARQUET-1800
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-12-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242521#comment-17242521
 ] 

Gabor Szadovszky commented on PARQUET-1792:
---

Removed the target 1.12.0.

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-12-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1792:
--
Fix Version/s: (was: 1.12.0)

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1714) Release parquet format 2.8.0

2020-12-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1714.
---
Resolution: Fixed

Parquet format 2.8.0 is already released only that I've missed to close this 
one.

> Release parquet format 2.8.0
> 
>
> Key: PARQUET-1714
> URL: https://issues.apache.org/jira/browse/PARQUET-1714
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.8.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1928) Interpret Parquet INT96 type as FIXED[12] AVRO Schema

2020-12-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242302#comment-17242302
 ] 

ASF GitHub Bot commented on PARQUET-1928:
-

anantdamle commented on pull request #831:
URL: https://github.com/apache/parquet-mr/pull/831#issuecomment-737194101


   Gentle bump up 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Interpret Parquet INT96 type as FIXED[12] AVRO Schema
> -
>
> Key: PARQUET-1928
> URL: https://issues.apache.org/jira/browse/PARQUET-1928
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.11.0
>Reporter: Anant Damle
>Priority: Minor
>  Labels: patch
> Fix For: 1.12.0
>
>
> Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` 
> causing it to throw `IllegalArgumentException("INT96 not implemented and is 
> deprecated")`
> Customers have large datasets which can't be reprocessed again to convert 
> into a supported type. An easier approach would be to convert into a byte 
> array of 12 bytes, that can then be interpreted by the developer in any way 
> they want to interpret it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] anantdamle commented on pull request #831: PARQUET-1928: Interpret Parquet INT96 type as FIXED[12] AVRO Schema

2020-12-02 Thread GitBox


anantdamle commented on pull request #831:
URL: https://github.com/apache/parquet-mr/pull/831#issuecomment-737194101


   Gentle bump up 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1851) ParquetMetadataConveter throws NPE in an Iceberg unit test

2020-12-02 Thread Junjie Chen (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242217#comment-17242217
 ] 

Junjie Chen commented on PARQUET-1851:
--

Even the client doesn't write data, we should not throw NPE.

> ParquetMetadataConveter throws NPE in an Iceberg unit test
> --
>
> Key: PARQUET-1851
> URL: https://issues.apache.org/jira/browse/PARQUET-1851
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Junjie Chen
>Priority: Major
>
> When writing data to parquet in an Iceberg unit test, it throws NPE as below
> {code:java}
> java.lang.NullPointerExceptionjava.lang.NullPointerException at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:476)
>  at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:177)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:914)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:864) 
> at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:206) at 
> org.apache.iceberg.data.TestLocalScan.writeFile(TestLocalScan.java:429)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (PARQUET-1851) ParquetMetadataConveter throws NPE in an Iceberg unit test

2020-12-02 Thread Junjie Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junjie Chen reopened PARQUET-1851:
--

> ParquetMetadataConveter throws NPE in an Iceberg unit test
> --
>
> Key: PARQUET-1851
> URL: https://issues.apache.org/jira/browse/PARQUET-1851
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Junjie Chen
>Priority: Major
>
> When writing data to parquet in an Iceberg unit test, it throws NPE as below
> {code:java}
> java.lang.NullPointerExceptionjava.lang.NullPointerException at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:476)
>  at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:177)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:914)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:864) 
> at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:206) at 
> org.apache.iceberg.data.TestLocalScan.writeFile(TestLocalScan.java:429)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)