RE: Date and time for next parquet sync

2019-01-17 Thread Santlal J Gupta
Hi team,

This email id is not available from this Friday onwards. Please add my another 
mail id(santlal561...@gmail.com) in parquet sync meeting.

From my mail id(santlal561...@gmail.com), I already sent a meeting request.

Thanks
Santlal J Gupta

-Original Message-
From: Santlal J Gupta 
Sent: Friday, September 29, 2017 10:19 AM
To: dev@parquet.apache.org
Subject: RE: Date and time for next parquet sync

Yes I want to join.

-Original Message-
From: Lars Volker [mailto:l...@cloudera.com] 
Sent: Thursday, September 28, 2017 8:40 PM
To: dev@parquet.apache.org
Subject: Date and time for next parquet sync

I sent out an meeting request for the next Parquet sync on Wednesday, October 
11th at 9am PST. Please reply to this email if you'd like to join and found 
yourself not on the invite yet.


[Discussion] How to build bloom filter in parquet

2019-01-17 Thread 俊杰陈
Hi Parquet Developers

In the bloom filter design doc we have discussed and determined bloom
filter definition, now I'd like to invite you to discuss how to build a
bloom filter in parquet.

In my current implementation, a bloom filter is created first according to
specified number of distinct values and false positives probability,  then
it is updated when column writer writing values. This way needs user to
estimate the column's NDV in a row group, however it is usually hard to get
this information for end users, especially, they don't have the row group
size info. So that the created bloom filter neither match the expected FPP
nor fit into size requirements. Though I could provide extra parameters
such as max bloom filter size to avoid wasting space, I think it still can
be improved.

So I think following three things need to be discussed at first.

1. What parameters/configurations should we present to end user?
In my mind, a better way is that users specify the column names and max
sizes they are willing to use to build the bloom filter.  Parquet takes
responsibility to calculate the NDV and create the bloom filter.

2. How to calculate the NDV at run time?
I tried to allocate a set to store all hash values for a column chunk and
then update bloom filter bitset at once when flushing row group. Not sure
whether it will cause some potential memory issue or not?

3. When to update bloom filter?
When writing values in column writer? or When flushing row group? If we use
the set to store distinct hash values, we can update when flushing row
group.

There should be more things to be caring about except above three.  Really
appreciate if you can provide any opinion or other thing you think need to
raise out.

-- 
Thanks & Best Regards


Re: [Discussion] How to build bloom filter in parquet

2019-01-17 Thread Gabor Szadovszky
Thanks for raising this, Junjie.

One more topic worth to add:
Which columns do we want to write bloom filters for? May it depend on the
type? Is bloom filter required if we have dictionary? Is bloom filter
required if the column is ordered and we have column indexes? (etc.)



On Thu, Jan 17, 2019 at 2:56 PM 俊杰陈  wrote:

> Hi Parquet Developers
>
> In the bloom filter design doc we have discussed and determined bloom
> filter definition, now I'd like to invite you to discuss how to build a
> bloom filter in parquet.
>
> In my current implementation, a bloom filter is created first according to
> specified number of distinct values and false positives probability,  then
> it is updated when column writer writing values. This way needs user to
> estimate the column's NDV in a row group, however it is usually hard to get
> this information for end users, especially, they don't have the row group
> size info. So that the created bloom filter neither match the expected FPP
> nor fit into size requirements. Though I could provide extra parameters
> such as max bloom filter size to avoid wasting space, I think it still can
> be improved.
>
> So I think following three things need to be discussed at first.
>
> 1. What parameters/configurations should we present to end user?
> In my mind, a better way is that users specify the column names and max
> sizes they are willing to use to build the bloom filter.  Parquet takes
> responsibility to calculate the NDV and create the bloom filter.
>
> 2. How to calculate the NDV at run time?
> I tried to allocate a set to store all hash values for a column chunk and
> then update bloom filter bitset at once when flushing row group. Not sure
> whether it will cause some potential memory issue or not?
>
> 3. When to update bloom filter?
> When writing values in column writer? or When flushing row group? If we use
> the set to store distinct hash values, we can update when flushing row
> group.
>
> There should be more things to be caring about except above three.  Really
> appreciate if you can provide any opinion or other thing you think need to
> raise out.
>
> --
> Thanks & Best Regards
>


[jira] [Commented] (PARQUET-1328) [java]Bloom filter read/write implementation

2019-01-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745158#comment-16745158
 ] 

ASF GitHub Bot commented on PARQUET-1328:
-

cjjnjust commented on pull request #587: PARQUET-1328: Add Bloom filter reader 
and writer
URL: https://github.com/apache/parquet-mr/pull/587
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [java]Bloom filter read/write implementation
> 
>
> Key: PARQUET-1328
> URL: https://issues.apache.org/jira/browse/PARQUET-1328
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1328) [java]Bloom filter read/write implementation

2019-01-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745159#comment-16745159
 ] 

ASF GitHub Bot commented on PARQUET-1328:
-

cjjnjust commented on pull request #587: PARQUET-1328: Add Bloom filter reader 
and writer
URL: https://github.com/apache/parquet-mr/pull/587
 
 
   the original pull request is base on master. This one is created against 
bloom-filter branch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [java]Bloom filter read/write implementation
> 
>
> Key: PARQUET-1328
> URL: https://issues.apache.org/jira/browse/PARQUET-1328
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discussion] How to build bloom filter in parquet

2019-01-17 Thread Zoltan Ivanfi
Hi,

I like the idea of specifying the maximum acceptable size of the bloom
filter bit vector. I think it would be much better than specifying the
expected number of distinct values (which we can not expect from the API
consumer in my opinion). The desired false positives probability could
still be specified and as a last step before writing the bloom filter to
the file it could be checked, dropping the filter if it does not fulfill
the check.

A bit of brainstorming (just some ideas that may or may not be useful): One
more thing to consider is whether some smart encoding of the bit vector
would help saving space. I expect the entropy of a nearly empty or nearly
full bloom filter to be relatively low, because they consist mostly of
zeroes/ones (respectively). For example, RLE encoding could take advantage
of such a pattern (but should only be used when it saves space, because
under regular circumstances the entropy will be high and RLE will only
increase the data size).

Alternatively, multiple bloom filters may be built at the same time until
it becomes obvious which one matches the data characteristics best. The
downside of this the increased number of hash calculations. This downside
could be worked around by only building a large bloom filter and "folding
the bit vector onto itself multiple times, cutting at the desired size
boundary". For example, if we build a bloom filter of 512 bits but in the
end we see that 64 bits would have been enough, we can split the bit vector
into 8 equal chunks and XOR them together. The resulting bit vector can
still function as a bloom filter by applying a modulo 64 operation to the
hashes during lookup. Its efficiency may be worse though than if we used
hash functions that directly map onto to 64 bits.

Br,

Zoltan

On Thu, Jan 17, 2019 at 2:56 PM 俊杰陈  wrote:

> Hi Parquet Developers
>
> In the bloom filter design doc we have discussed and determined bloom
> filter definition, now I'd like to invite you to discuss how to build a
> bloom filter in parquet.
>
> In my current implementation, a bloom filter is created first according to
> specified number of distinct values and false positives probability,  then
> it is updated when column writer writing values. This way needs user to
> estimate the column's NDV in a row group, however it is usually hard to get
> this information for end users, especially, they don't have the row group
> size info. So that the created bloom filter neither match the expected FPP
> nor fit into size requirements. Though I could provide extra parameters
> such as max bloom filter size to avoid wasting space, I think it still can
> be improved.
>
> So I think following three things need to be discussed at first.
>
> 1. What parameters/configurations should we present to end user?
> In my mind, a better way is that users specify the column names and max
> sizes they are willing to use to build the bloom filter.  Parquet takes
> responsibility to calculate the NDV and create the bloom filter.
>
> 2. How to calculate the NDV at run time?
> I tried to allocate a set to store all hash values for a column chunk and
> then update bloom filter bitset at once when flushing row group. Not sure
> whether it will cause some potential memory issue or not?
>
> 3. When to update bloom filter?
> When writing values in column writer? or When flushing row group? If we use
> the set to store distinct hash values, we can update when flushing row
> group.
>
> There should be more things to be caring about except above three.  Really
> appreciate if you can provide any opinion or other thing you think need to
> raise out.
>
> --
> Thanks & Best Regards
>


Adding more timestamp types to on-disk storage formats

2019-01-17 Thread Zoltan Ivanfi
Hi,

There is an ongoing effort amongst the SQL engines of the Hadoop stack
to support different timestamp semantics. This development has some
implications for the low-level timestamp types as well. The new
timestamp types added to the different SQL engines will rely on the
decisions of the lower level components about which timestamps
semantics they support and how.

I have created a document to summarize what this means for on-disk
storage formats in general and I am sending it out to multiple dev
mailing lists to let you know about this new requirement and to
initiate a discussion in affected open source components.

The document can be read here:
https://docs.google.com/document/d/1E-7miCh4qK6Mg54b-Dh5VOyhGX8V4xdMXKIHJL36a9U/edit

Please let me know your thoughts.

Thanks,

Zoltan


Re: [VOTE] Release Apache Parquet 1.11.0 RC3

2019-01-17 Thread Zoltan Ivanfi
Hi,

Friendly reminder to please vote for the release. We need 2 more binding +1
votes.

Thanks,

Zoltan

On Sat, Jan 12, 2019 at 3:07 AM 俊杰陈  wrote:

> +1  (non-binding)
> * contents looks good
> * unit tests passed
>
>
> Zoltan Ivanfi  于2019年1月11日周五 下午9:31写道:
>
> > +1 (binding)
> >
> > * contents look good
> > * unit tests pass
> > * checksums match
> > * signature matches
> >
> > Br,
> >
> > Zoltan
> >
> > On Thu, Jan 10, 2019 at 11:48 AM Gabor Szadovszky 
> > wrote:
> >
> > > Hi,
> > >
> > > Checked tarball: checksum/signature are correct. Content is correct
> based
> > > on release tag. Unit tests pass.
> > >
> > > +1 (non-binding)
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Wed, Jan 9, 2019 at 4:51 PM Zoltan Ivanfi 
> > > wrote:
> > >
> > > > Dear Parquet Users and Developers,
> > > >
> > > > I propose the following RC to be released as the official Apache
> > > > Parquet 1.11.0 release:
> > > >
> > > > The commit id is 8be767d12cca295cf9858a521725fc440b0c6f93
> > > > * This corresponds to the tag: apache-parquet-1.11.0
> > > > *
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/8be767d12cca295cf9858a521725fc440b0c6f93
> > > >
> > > > The release tarball, signature, and checksums are here:
> > > > *
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.0-rc3/
> > > >
> > > > You can find the KEYS file here:
> > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > > *
> > > >
> > >
> >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet/1.11.0/
> > > >
> > > > This release includes the following new features:
> > > >
> > > > - PARQUET-1201 - Column indexes
> > > > - PARQUET-1253 - Support for new logical type representation
> > > > - PARQUET-1381 - Add merge blocks command to parquet-tools
> > > > - PARQUET-1388 - Nanosecond precision time and timestamp - parquet-mr
> > > >
> > > > The release also includes bug fixes, including:
> > > >
> > > > - PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY.
> > > >
> > > > Please download, verify, and test. The vote will be open for at least
> > 72
> > > > hours.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > >
> >
>
>
> --
> Thanks & Best Regards
>


[jira] [Updated] (PARQUET-1396) Cryptodata Interface for Schema Activation of Parquet Encryption

2019-01-17 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1396:
-
Summary: Cryptodata Interface for Schema Activation of Parquet Encryption  
(was: Cryptodata Interface for no-API Activation of Parquet Encryption)

> Cryptodata Interface for Schema Activation of Parquet Encryption
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.10.1
>
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. This JIRA provides a crypto data interface for non-API activation of 
> Parquet encryption and serves as a high-level layer on top of PARQUET-1178 to 
> enable fine-grained and flexible column level access control, with pluggable 
> key access module, without a need to use the low-level encryption APIs. Also, 
> this feature will enable seamless integration with existing clients.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1396) Cryptodata Interface for Schema Activation of Parquet Encryption

2019-01-17 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1396:
-
Description: 
This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) that 
will provide the basic building blocks and APIs for the encryption support. 

This JIRA provides a crypto data interface for schema activation of Parquet 
encryption and serves as a high-level layer on top of PARQUET-1178 to make the 
adoption of Parquet-1178 easier, with pluggable key access module, without a 
need to use the low-level encryption APIs. Also, this feature will enable 
seamless integration with existing clients.

  was:This JIRA is an extension to Parquet Modular Encryption 
Jira(PARQUET-1178) that will provide the basic building blocks and APIs for the 
encryption support. This JIRA provides a crypto data interface for non-API 
activation of Parquet encryption and serves as a high-level layer on top of 
PARQUET-1178 to enable fine-grained and flexible column level access control, 
with pluggable key access module, without a need to use the low-level 
encryption APIs. Also, this feature will enable seamless integration with 
existing clients.


> Cryptodata Interface for Schema Activation of Parquet Encryption
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.10.1
>
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1396) Cryptodata Interface for Schema Activation of Parquet Encryption

2019-01-17 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1396:
-
Description: 
This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) that 
will provide the basic building blocks and APIs for the encryption support. 

This JIRA provides a crypto data interface for schema activation of Parquet 
encryption and serves as a high-level layer on top of PARQUET-1178 to make the 
adoption of Parquet-1178 easier, with pluggable key access module, without a 
need to use the low-level encryption APIs. Also, this feature will enable 
seamless integration with existing clients.

No change to specifications (Parquet-format), no new Parquet APIs, and no 
changes in existing Parquet APIs. All current applications, tests, etc, will 
work.

>From developer perspective, they can just implement the interface into a 
>plugin which can be attached any Parquet application like Hive/Spark etc. This 
>decouples the complexity of dealing with KMS and schema from Parquet 
>applications. In large organization, they may have hundreds or even thousands 
>of Parquet applications and pipelines. The decoupling would make Parquet 
>encryption easier to be adopted.  

>From end user(for example data owner) perspective, if they think a column is 
>sensitive, they can just set that column’s schema as sensitive and then the 
>Parquet application just encrypt that column automatically. This makes end 
>user easy to manage the encryptions of their columns.  

  was:
This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) that 
will provide the basic building blocks and APIs for the encryption support. 

This JIRA provides a crypto data interface for schema activation of Parquet 
encryption and serves as a high-level layer on top of PARQUET-1178 to make the 
adoption of Parquet-1178 easier, with pluggable key access module, without a 
need to use the low-level encryption APIs. Also, this feature will enable 
seamless integration with existing clients.


> Cryptodata Interface for Schema Activation of Parquet Encryption
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.10.1
>
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.
> No change to specifications (Parquet-format), no new Parquet APIs, and no 
> changes in existing Parquet APIs. All current applications, tests, etc, will 
> work.
> From developer perspective, they can just implement the interface into a 
> plugin which can be attached any Parquet application like Hive/Spark etc. 
> This decouples the complexity of dealing with KMS and schema from Parquet 
> applications. In large organization, they may have hundreds or even thousands 
> of Parquet applications and pipelines. The decoupling would make Parquet 
> encryption easier to be adopted.  
> From end user(for example data owner) perspective, if they think a column is 
> sensitive, they can just set that column’s schema as sensitive and then the 
> Parquet application just encrypt that column automatically. This makes end 
> user easy to manage the encryptions of their columns.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Please review Parquet-1396 - Crypto Interface for Schema Activation of Parquet Encryption

2019-01-17 Thread Xinli shang
Dear all,

As Parquet-1178  passed
the vote, I would like to bring Parquet-1396
 (Crypto Interface for
Schema Activation of Parquet Encryption

) for a design review. The motivation of Parquet-1396
 is to make Parquet-1178
 easier to be
integrated into existing applications and services. We will provide a
sample to implement this interface. The summaries of Parquet-1396
 are as below. For
details, please have a look at the design Crypto Interface for Schema
Activation of Parquet Encryption
.


Problem statement

Parquet-1178  provided
the column encryption inside Parquet file, but to adopt it, Parquet
applications like Spark/Hive need to change the code by 1) calling the new
crypto API; 2) determining column sensitivity and building relative column
encryption properties; 3) handling key management which most likely
involves interacting with remote KMS(Key Management Service). In reality,
especially in a large organization, the Parquet library is being used by
many different applications and teams. Hence making a relative significant
code change to every application could make the adoption of the encryption
feature harder and slower. In this case, people usually prefer minimum
changes like upgrading Parquet library and configuration changes to enable
the feature.

Some organizations have centralized schema storage where developers can
define their table schemas. To adopt Parquet modular encryption, it would
be a natural choice for those developers by just changing their schema, for
example, appending a boolean configuration to a column to enable encryption
for that column, in the centralized schema storage. And then the technical
stack including ingesting pipelines, compute engines etc, just use that
schema to control column encryptions inside Parquet files, without any
further user involvement for encryption. Even there is no centralized
schema storage, it may still be desirable to control the column encryption
by just changing setting in the schema because it eases the encryption
management.
Goals

   1.

   Provide an interface for activating of the Parquet encryption proposed
   by Parquet-1178 , by
   the passed-in schema, wrapping up key management and several other crypto
   settings into a plugin which is the implementation of this interface, in
   order to ease the adopt of Parquet modular encryption.
   2.

   No change to specifications (Parquet-format), no new Parquet APIs, and
   no changes in existing Parquet APIs. All current applications, tests, etc,
   will work.
   3.

   If there is no plugin(the implementation of the proposed interface)
   configured in Hadoop configuration, all the changes discussed in this
   design will be bypassed, and all the existing behaviors will just work as
   before.

Technical approach

We propose a module - Crypto Properties Retriever which parses the passed
in schema from Parquet application, retrieves keys, key metadata, AAD
(Additional Authentication Data) prefix from external service, and manages
several other encryption settings. The encryption/decryption module added
by Parquet-1178  can
get the needed encryption information from this retriever to do the
encryption and decryption. This module will be released as a Parquet
plugin, which can be configured in Hadoop configuration to enable. This
proposal defines an interface contracting the implementation of this
plugin. The diagram below shows the relation between Parquet-1396
 and Parquet-1178
.




>From developer perspective, they can just implement the interface into a
plugin which can be attached any Parquet application like Hive/Spark etc.
This decouples the complexity of dealing with KMS and schema from Parquet
applications. In large organization, they may have hundreds or even
thousands of Parquet applications and pipelines. The decoupling would make
Parquet encryption easier to be adopted.

>From end user(for example data owner) perspective, if they think a column
is sensitive, they can just set that column’s schema as sensitive and then
the Parquet application just encrypt that column automatically. This makes
end user easy to manage the encryptions of their columns.


Thanks in advance for your time! Looking forward to your feedbacks!

--

Xinli Shang

Uber Big Data Team