[jira] [Updated] (PARQUET-1373) Encryption key management tools

2020-09-23 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1373:
--
Fix Version/s: (was: encryption-feature-branch)
   1.12.0
Affects Version/s: 1.12.0

> Encryption key management tools 
> 
>
> Key: PARQUET-1373
> URL: https://issues.apache.org/jira/browse/PARQUET-1373
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> Parquet Modular Encryption 
> ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides 
> an API that accepts keys, arbitrary key metadata and key retrieval callbacks 
> - which allows to implement basically any key management policy on top of it. 
> This Jira will add tools that implement a set of best practice elements for 
> key management. This is not an end-to-end key management, but rather a set of 
> components that might simplify design and development of an end-to-end 
> solution.
> This tool set is one of many possible. There is no goal to create a single or 
> “standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory 
> interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) 
> that allows to plug in different implementations of encryption key management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption

2020-09-23 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1854:
--
  Component/s: parquet-mr
Fix Version/s: (was: encryption-feature-branch)
   1.12.0
Affects Version/s: 1.12.0

> Properties-Driven Interface to Parquet Encryption
> -
>
> Key: PARQUET-1854
> URL: https://issues.apache.org/jira/browse/PARQUET-1854
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> A high-level interface to Parquet encryption layer, based on configuration 
> properties (table properties, Hadoop configuration, writer/reader options, 
> etc)  -  will  simplify the activation and configuration of data encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1178) Parquet modular encryption

2020-09-23 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1178:
--
  Component/s: parquet-mr
Affects Version/s: 1.12.0

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1915) Add null command

2020-09-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201243#comment-17201243
 ] 

ASF GitHub Bot commented on PARQUET-1915:
-

shangxinli opened a new pull request #819:
URL: https://github.com/apache/parquet-mr/pull/819


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add null command 
> -
>
> Key: PARQUET-1915
> URL: https://issues.apache.org/jira/browse/PARQUET-1915
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli opened a new pull request #819: PARQUET-1915: Add nullify column

2020-09-23 Thread GitBox


shangxinli opened a new pull request #819:
URL: https://github.com/apache/parquet-mr/pull/819


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (PARQUET-1916) Add hash functionality

2020-09-23 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1916:


 Summary: Add hash functionality 
 Key: PARQUET-1916
 URL: https://issues.apache.org/jira/browse/PARQUET-1916
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1915) Add null command

2020-09-23 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-1915:


Assignee: Xinli Shang

> Add null command 
> -
>
> Key: PARQUET-1915
> URL: https://issues.apache.org/jira/browse/PARQUET-1915
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1915) Add null command

2020-09-23 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1915:


 Summary: Add null command 
 Key: PARQUET-1915
 URL: https://issues.apache.org/jira/browse/PARQUET-1915
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Metadata summary file deprecation

2020-09-23 Thread Jacques Nadeau
Hey Jason,

I'd suggest you look at Apache Iceberg. It is a much more mature way of
handling metadata efficiency issues and provides a substantial superset of
functionality over the old metadata cache files.

On Wed, Sep 23, 2020 at 4:16 PM Jason Altekruse 
wrote:

> Hello again,
>
> I took a look through the mail archives and found a little more information
> in this and a few other threads.
>
>
> http://mail-archives.apache.org/mod_mbox//parquet-dev/201707.mbox/%3CCAO4re1k8-bZZZWBRuLCnm1V7AoJE1fdunSuBn%2BecRuFGPgcXnA%40mail.gmail.com%3E
>
> While I do understand the benefits for federating out the reading of
> footers for the sake of not worrying about synchronization between the
> cached metadata and any changes to the files on disk, it does appear there
> is still a use case that isn't solved well with this design, needle in a
> haystack selective filter queries, where the data is sorted by the filter
> column. For example in the tests I ran with queries against lots of parquet
> files where the vast majority are pruned by a bunch of small tasks, it
> takes 33 seconds vs just 1-2 seconds with driver side pruning using the
> summary file (requires a small spark changet).
>
> In our use case we are never going to be replacing contents of existing
> parquet files (with a delete and rewrite on HDFS) or appending new row
> groups onto existing files. In that case I don't believe we should
> experience any correctness problems, but I wanted to confirm if there is
> something I am missing. I am
> using readAllFootersInParallelUsingSummaryFiles which does fall back to
> read individual footers if they are missing from the summary file.
>
> I am also curious if a solution to the correctness problems could be to
> include a file length and/or last modified time into the summary file,
> which could confirm against FS metadata that the files on disk are still in
> sync with the metadata summary relatively quickly. Would it be possible to
> consider avoiding this deprecation if I was to work on an update to
> implement this?
>
> - Jason Altekruse
>
>
> On Tue, Sep 15, 2020 at 8:52 PM Jason Altekruse 
> wrote:
>
> > Hello all,
> >
> > I have been working on optimizing reads in spark to avoid spinning up
> lots
> > of short lived tasks that just perform row group pruning in selective
> > filter queries.
> >
> > My high level question is why metadata summary files were marked
> > deprecated in this Parquet changeset? There isn't much explanation given
> > or a description of what should be used instead.
> > https://github.com/apache/parquet-mr/pull/429
> >
> > There are other members of the broader parquet community that are also
> > confused by this deprecation, see this discussion in an arrow PR.
> > https://github.com/apache/arrow/pull/4166
> >
> > In the course of making my small prototype I got an extra performance
> > boost by making spark write out metadata summary files, rather than
> having
> > to read all footers on the driver. This effect would be even more
> > pronounced on a completely remote storage system like S3. Writing these
> > summary files was disabled by default in SPARK-15719, because of the
> > performance impact of appending a small number of new files to an
> existing
> > dataset with many files.
> >
> > https://issues.apache.org/jira/browse/SPARK-15719
> >
> > This spark JIRA does make decent points considering how spark operates
> > today, but I think that there is a performance optimization opportunity
> > that is missed because the row group pruning is deferred to a bunch of
> > separate short lived tasks rather than done upfront, currently spark only
> > uses footers on the driver for schema merging.
> >
> > Thanks for the help!
> > Jason Altekruse
> >
>


Re: Metadata summary file deprecation

2020-09-23 Thread Jason Altekruse
Hello again,

I took a look through the mail archives and found a little more information
in this and a few other threads.

http://mail-archives.apache.org/mod_mbox//parquet-dev/201707.mbox/%3CCAO4re1k8-bZZZWBRuLCnm1V7AoJE1fdunSuBn%2BecRuFGPgcXnA%40mail.gmail.com%3E

While I do understand the benefits for federating out the reading of
footers for the sake of not worrying about synchronization between the
cached metadata and any changes to the files on disk, it does appear there
is still a use case that isn't solved well with this design, needle in a
haystack selective filter queries, where the data is sorted by the filter
column. For example in the tests I ran with queries against lots of parquet
files where the vast majority are pruned by a bunch of small tasks, it
takes 33 seconds vs just 1-2 seconds with driver side pruning using the
summary file (requires a small spark changet).

In our use case we are never going to be replacing contents of existing
parquet files (with a delete and rewrite on HDFS) or appending new row
groups onto existing files. In that case I don't believe we should
experience any correctness problems, but I wanted to confirm if there is
something I am missing. I am
using readAllFootersInParallelUsingSummaryFiles which does fall back to
read individual footers if they are missing from the summary file.

I am also curious if a solution to the correctness problems could be to
include a file length and/or last modified time into the summary file,
which could confirm against FS metadata that the files on disk are still in
sync with the metadata summary relatively quickly. Would it be possible to
consider avoiding this deprecation if I was to work on an update to
implement this?

- Jason Altekruse


On Tue, Sep 15, 2020 at 8:52 PM Jason Altekruse 
wrote:

> Hello all,
>
> I have been working on optimizing reads in spark to avoid spinning up lots
> of short lived tasks that just perform row group pruning in selective
> filter queries.
>
> My high level question is why metadata summary files were marked
> deprecated in this Parquet changeset? There isn't much explanation given
> or a description of what should be used instead.
> https://github.com/apache/parquet-mr/pull/429
>
> There are other members of the broader parquet community that are also
> confused by this deprecation, see this discussion in an arrow PR.
> https://github.com/apache/arrow/pull/4166
>
> In the course of making my small prototype I got an extra performance
> boost by making spark write out metadata summary files, rather than having
> to read all footers on the driver. This effect would be even more
> pronounced on a completely remote storage system like S3. Writing these
> summary files was disabled by default in SPARK-15719, because of the
> performance impact of appending a small number of new files to an existing
> dataset with many files.
>
> https://issues.apache.org/jira/browse/SPARK-15719
>
> This spark JIRA does make decent points considering how spark operates
> today, but I think that there is a performance optimization opportunity
> that is missed because the row group pruning is deferred to a bunch of
> separate short lived tasks rather than done upfront, currently spark only
> uses footers on the driver for schema merging.
>
> Thanks for the help!
> Jason Altekruse
>


[GitHub] [parquet-mr] dossett opened a new pull request #818: Remove brew install since thrift 0.12 isn't available

2020-09-23 Thread GitBox


dossett opened a new pull request #818:
URL: https://github.com/apache/parquet-mr/pull/818


   brew has 0.13 and 0.9 available, but not 0.12 so this command will fail
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] SinghAsDev commented on pull request #222: Parquet-313: Implement 3 level list writing rule for Parquet-Thrift

2020-09-23 Thread GitBox


SinghAsDev commented on pull request #222:
URL: https://github.com/apache/parquet-mr/pull/222#issuecomment-697699892


   > @ttim and @tlazaro, do you mind taking a look and letting us know if this 
looks ok to you? We'd want to get this merged.
   
   Hey @ttim and @tlazaro, we have been using this at Pinterest for a few years 
just fine, and are wanting to have this merged in so that we can get rid of our 
internal fork. Also, this is config controlled, so default behavior remains 
unchanged. Would really appreciate if you can help reviewing this from twitter 
side, so that we can go ahead with merging this. I can rebase the patch right 
after.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org