[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli

2020-08-13 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176961#comment-17176961
 ] 

Xinli Shang commented on PARQUET-1801:
--

I will try to do it in 1.12.0.  

The feature works great! We removed columns in many whale tables and 
significant storage space was saved. I will have a talk in ApacheCon 2020 to 
present this topic.  

> Add column index support for 'prune' command in Parquet-tools/cli
> -
>
> Key: PARQUET-1801
> URL: https://issues.apache.org/jira/browse/PARQUET-1801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-08-13 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176959#comment-17176959
 ] 

Xinli Shang commented on PARQUET-1792:
--

We might want to push it for next release. 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1895) Update jackson-databind

2020-08-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176935#comment-17176935
 ] 

ASF GitHub Bot commented on PARQUET-1895:
-

gszadovszky opened a new pull request #811:
URL: https://github.com/apache/parquet-mr/pull/811


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update jackson-databind
> ---
>
> Key: PARQUET-1895
> URL: https://issues.apache.org/jira/browse/PARQUET-1895
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Patrick OFriel
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.0
>
>
> The jackson databind 2.9.10.4 has the following CVEs:
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14060]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14061]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14062]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14195]
> They should be resolved if we update to 2.9.10.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] gszadovszky opened a new pull request #811: PARQUET-1895: Update jackson-databind

2020-08-13 Thread GitBox


gszadovszky opened a new pull request #811:
URL: https://github.com/apache/parquet-mr/pull/811


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (PARQUET-1895) Update jackson-databind

2020-08-13 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1895:
--
Summary: Update jackson-databind  (was: Update jackson-databind to 2.9.10.5)

> Update jackson-databind
> ---
>
> Key: PARQUET-1895
> URL: https://issues.apache.org/jira/browse/PARQUET-1895
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Patrick OFriel
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.0
>
>
> The jackson databind 2.9.10.4 has the following CVEs:
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14060]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14061]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14062]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14195]
> They should be resolved if we update to 2.9.10.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1895) Update jackson-databind to 2.9.10.5

2020-08-13 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1895:
-

Assignee: Gabor Szadovszky

> Update jackson-databind to 2.9.10.5
> ---
>
> Key: PARQUET-1895
> URL: https://issues.apache.org/jira/browse/PARQUET-1895
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Patrick OFriel
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.0
>
>
> The jackson databind 2.9.10.4 has the following CVEs:
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14060]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14061]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14062]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14195]
> They should be resolved if we update to 2.9.10.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1895) Update jackson-databind to 2.9.10.5

2020-08-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176905#comment-17176905
 ] 

Gabor Szadovszky commented on PARQUET-1895:
---

2.11.2 is later (in terms of release date) than 2.9.10.5. So, I guess it would 
be a better choice.

> Update jackson-databind to 2.9.10.5
> ---
>
> Key: PARQUET-1895
> URL: https://issues.apache.org/jira/browse/PARQUET-1895
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Patrick OFriel
>Priority: Major
> Fix For: 1.12.0
>
>
> The jackson databind 2.9.10.4 has the following CVEs:
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14060]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14061]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14062]
> [https://nvd.nist.gov/vuln/detail/CVE-2020-14195]
> They should be resolved if we update to 2.9.10.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] mauliksoneji opened a new pull request #810: nexus-setup

2020-08-13 Thread GitBox


mauliksoneji opened a new pull request #810:
URL: https://github.com/apache/parquet-mr/pull/810


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] mauliksoneji closed pull request #810: nexus-setup

2020-08-13 Thread GitBox


mauliksoneji closed pull request #810:
URL: https://github.com/apache/parquet-mr/pull/810


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1676) Remove hive modules

2020-08-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176889#comment-17176889
 ] 

Gabor Szadovszky commented on PARQUET-1676:
---

[~fokko], do you want to pick this up for 1.12.0?

> Remove hive modules
> ---
>
> Key: PARQUET-1676
> URL: https://issues.apache.org/jira/browse/PARQUET-1676
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.10.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Remove the hive modules as discusses in the Parquet sync.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-08-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176888#comment-17176888
 ] 

Gabor Szadovszky commented on PARQUET-1666:
---

I think, we are good to remove the Hive modules for 1.12.0.

[~julienledem], do you have any feedback about Scrooge?

What would be the next steps for the others? I think, we can deprecate them in 
1.12.0 and remove later.

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1800) Add 'prune' command to parquet-cli

2020-08-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176884#comment-17176884
 ] 

Gabor Szadovszky commented on PARQUET-1800:
---

[~sha...@uber.com], do you want to do that for 1.12.0?

> Add 'prune' command to parquet-cli
> --
>
> Key: PARQUET-1800
> URL: https://issues.apache.org/jira/browse/PARQUET-1800
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-08-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176883#comment-17176883
 ] 

Gabor Szadovszky commented on PARQUET-1792:
---

[~sha...@uber.com], is this still targeted for 1.12.0?

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli

2020-08-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176882#comment-17176882
 ] 

Gabor Szadovszky commented on PARQUET-1801:
---

[~sha...@uber.com], do you think you can work on this for 1.12.0? How does the 
prune feature works currently? Is it a must to include in 1.12.0?

> Add column index support for 'prune' command in Parquet-tools/cli
> -
>
> Key: PARQUET-1801
> URL: https://issues.apache.org/jira/browse/PARQUET-1801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Content of parquet-mr 1.12.0

2020-08-13 Thread Gabor Szadovszky
Hi everyone,

We have two big changes (the bloom filters and the column encryption) that
are already in master. I think it is time for a minor release to make these
great features available for the users.

I've created the jira PARQUET-1898
 to track the release.
Please link any open jiras to the release jira as a blocker so we won't
miss anything.

Thanks a lot,
Gabor


[jira] [Created] (PARQUET-1898) Release parquet-mr 1.12.0

2020-08-13 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1898:
-

 Summary: Release parquet-mr 1.12.0
 Key: PARQUET-1898
 URL: https://issues.apache.org/jira/browse/PARQUET-1898
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky
 Fix For: 1.12.0


This task is to track the parquet-mr release 1.12.0.

Any open Jira that we would like to be added to this release shall be linked to 
this one as a blocker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)