[GitHub] [parquet-mr] garawalid commented on a change in pull request #781: PARQUET-1826: Document Hadoop configuration options

2020-04-21 Thread GitBox


garawalid commented on a change in pull request #781:
URL: https://github.com/apache/parquet-mr/pull/781#discussion_r412497639



##
File path: parquet-hadoop/README.md
##
@@ -230,23 +236,28 @@ conf.set("parquet.bloom.filter.expected.ndv#column.path", 
200)
 ## Class: ParquetInputFormat
 
 **Property:** `parquet.read.support.class`  
-**Description:** The read support class.
+**Description:** The read support class that is used in
+ParquetInputFormat to materialize records. It should be a the descendant class 
of `org.apache.parquet.hadoop.api.ReadSupport`
 
 ---
 
 **Property:** `parquet.read.filter`  
-**Description:** **Todo**
+**Description:** The filter class name that implements 
`org.apache.parquet.filter.UnboundRecordFilter`. This class is for the old 
filter API in the package `org.apache.parquet.filter`, it filters records 
during record assembly.
 
 ---
 
-**Property:** `parquet.strict.typing`  
-**Description:** Whether to enable type checking for conflicting schema.  
-**Default value:** `true`
+ **Property:** `parquet.private.read.filter.predicate`  
+ **Description:** The filter class used in the new filter API in the package 
`org.apache.parquet.filter2.predicate`
+ Note that this class should implements 
`org.apache.parquet.filter2..FilterPredicate` and the value of this property 
should be a gzip compressed base64 encoded java serialized object.  

Review comment:
   I think it's okay if we keep the details of the object. After all, we 
will suggest using the `setFilterPredicate` method.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1826) Document hadoop configuration options

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089079#comment-17089079
 ] 

ASF GitHub Bot commented on PARQUET-1826:
-

garawalid commented on a change in pull request #781:
URL: https://github.com/apache/parquet-mr/pull/781#discussion_r412497639



##
File path: parquet-hadoop/README.md
##
@@ -230,23 +236,28 @@ conf.set("parquet.bloom.filter.expected.ndv#column.path", 
200)
 ## Class: ParquetInputFormat
 
 **Property:** `parquet.read.support.class`  
-**Description:** The read support class.
+**Description:** The read support class that is used in
+ParquetInputFormat to materialize records. It should be a the descendant class 
of `org.apache.parquet.hadoop.api.ReadSupport`
 
 ---
 
 **Property:** `parquet.read.filter`  
-**Description:** **Todo**
+**Description:** The filter class name that implements 
`org.apache.parquet.filter.UnboundRecordFilter`. This class is for the old 
filter API in the package `org.apache.parquet.filter`, it filters records 
during record assembly.
 
 ---
 
-**Property:** `parquet.strict.typing`  
-**Description:** Whether to enable type checking for conflicting schema.  
-**Default value:** `true`
+ **Property:** `parquet.private.read.filter.predicate`  
+ **Description:** The filter class used in the new filter API in the package 
`org.apache.parquet.filter2.predicate`
+ Note that this class should implements 
`org.apache.parquet.filter2..FilterPredicate` and the value of this property 
should be a gzip compressed base64 encoded java serialized object.  

Review comment:
   I think it's okay if we keep the details of the object. After all, we 
will suggest using the `setFilterPredicate` method.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document hadoop configuration options
> -
>
> Key: PARQUET-1826
> URL: https://issues.apache.org/jira/browse/PARQUET-1826
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Walid Gara
>Priority: Major
>  Labels: pull-request-available
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] garawalid commented on a change in pull request #781: PARQUET-1826: Document Hadoop configuration options

2020-04-21 Thread GitBox


garawalid commented on a change in pull request #781:
URL: https://github.com/apache/parquet-mr/pull/781#discussion_r412496194



##
File path: parquet-hadoop/README.md
##
@@ -158,7 +164,7 @@ This property should be between 0 and 1.
 ---
 
 **Property:** `parquet.page.size.row.check.max`  
-**Description:** The maximum number of rows per page.  
+**Description:** The frequency of checks of the page size limit. In other 
words, we perform the checking after each `parquet.page.size.row.check.max` 
rows.  

Review comment:
   Thanks @gszadovszky for the clarification, I'll duplicate the 
description for both `parquet.page.size.row.check.min` and 
`parquet.page.size.row.check.max` properties.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1826) Document hadoop configuration options

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089078#comment-17089078
 ] 

ASF GitHub Bot commented on PARQUET-1826:
-

garawalid commented on a change in pull request #781:
URL: https://github.com/apache/parquet-mr/pull/781#discussion_r412496194



##
File path: parquet-hadoop/README.md
##
@@ -158,7 +164,7 @@ This property should be between 0 and 1.
 ---
 
 **Property:** `parquet.page.size.row.check.max`  
-**Description:** The maximum number of rows per page.  
+**Description:** The frequency of checks of the page size limit. In other 
words, we perform the checking after each `parquet.page.size.row.check.max` 
rows.  

Review comment:
   Thanks @gszadovszky for the clarification, I'll duplicate the 
description for both `parquet.page.size.row.check.min` and 
`parquet.page.size.row.check.max` properties.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document hadoop configuration options
> -
>
> Key: PARQUET-1826
> URL: https://issues.apache.org/jira/browse/PARQUET-1826
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Walid Gara
>Priority: Major
>  Labels: pull-request-available
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] garawalid commented on a change in pull request #781: PARQUET-1826: Document Hadoop configuration options

2020-04-21 Thread GitBox


garawalid commented on a change in pull request #781:
URL: https://github.com/apache/parquet-mr/pull/781#discussion_r412495171



##
File path: parquet-hadoop/README.md
##
@@ -230,23 +236,28 @@ conf.set("parquet.bloom.filter.expected.ndv#column.path", 
200)
 ## Class: ParquetInputFormat
 
 **Property:** `parquet.read.support.class`  
-**Description:** The read support class.
+**Description:** The read support class that is used in
+ParquetInputFormat to materialize records. It should be a the descendant class 
of `org.apache.parquet.hadoop.api.ReadSupport`
 
 ---
 
 **Property:** `parquet.read.filter`  
-**Description:** **Todo**
+**Description:** The filter class name that implements 
`org.apache.parquet.filter.UnboundRecordFilter`. This class is for the old 
filter API in the package `org.apache.parquet.filter`, it filters records 
during record assembly.
 
 ---
 
-**Property:** `parquet.strict.typing`  
-**Description:** Whether to enable type checking for conflicting schema.  
-**Default value:** `true`
+ **Property:** `parquet.private.read.filter.predicate`  
+ **Description:** The filter class used in the new filter API in the package 
`org.apache.parquet.filter2.predicate`
+ Note that this class should implements 
`org.apache.parquet.filter2..FilterPredicate` and the value of this property 
should be a gzip compressed base64 encoded java serialized object.  
+ The new filter API can filter records or filter entire row groups of records 
without reading them at all.
+
+**Note:** User should either use the old filter API (`parquet.read.filter`) or 
the new one (`parquet.private.read.filter.predicate`).

Review comment:
   I agree! 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1826) Document hadoop configuration options

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089075#comment-17089075
 ] 

ASF GitHub Bot commented on PARQUET-1826:
-

garawalid commented on a change in pull request #781:
URL: https://github.com/apache/parquet-mr/pull/781#discussion_r412495171



##
File path: parquet-hadoop/README.md
##
@@ -230,23 +236,28 @@ conf.set("parquet.bloom.filter.expected.ndv#column.path", 
200)
 ## Class: ParquetInputFormat
 
 **Property:** `parquet.read.support.class`  
-**Description:** The read support class.
+**Description:** The read support class that is used in
+ParquetInputFormat to materialize records. It should be a the descendant class 
of `org.apache.parquet.hadoop.api.ReadSupport`
 
 ---
 
 **Property:** `parquet.read.filter`  
-**Description:** **Todo**
+**Description:** The filter class name that implements 
`org.apache.parquet.filter.UnboundRecordFilter`. This class is for the old 
filter API in the package `org.apache.parquet.filter`, it filters records 
during record assembly.
 
 ---
 
-**Property:** `parquet.strict.typing`  
-**Description:** Whether to enable type checking for conflicting schema.  
-**Default value:** `true`
+ **Property:** `parquet.private.read.filter.predicate`  
+ **Description:** The filter class used in the new filter API in the package 
`org.apache.parquet.filter2.predicate`
+ Note that this class should implements 
`org.apache.parquet.filter2..FilterPredicate` and the value of this property 
should be a gzip compressed base64 encoded java serialized object.  
+ The new filter API can filter records or filter entire row groups of records 
without reading them at all.
+
+**Note:** User should either use the old filter API (`parquet.read.filter`) or 
the new one (`parquet.private.read.filter.predicate`).

Review comment:
   I agree! 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document hadoop configuration options
> -
>
> Key: PARQUET-1826
> URL: https://issues.apache.org/jira/browse/PARQUET-1826
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Walid Gara
>Priority: Major
>  Labels: pull-request-available
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1844) Removed Hadoop transitive dependency on commons-lang

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088702#comment-17088702
 ] 

ASF GitHub Bot commented on PARQUET-1844:
-

gszadovszky commented on issue #787:
URL: https://github.com/apache/parquet-mr/pull/787#issuecomment-617188187


   @shangxinli, could you please check if you have some time? It is required 
for hadoop 3.3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Removed Hadoop transitive dependency on commons-lang
> 
>
> Key: PARQUET-1844
> URL: https://issues.apache.org/jira/browse/PARQUET-1844
> Project: Parquet
>  Issue Type: Task
>Reporter: Gabor Szadovszky
>Priority: Major
>
> Some of our code parts are using commons-lang without declaring direct 
> dependency on it. It comes as a transitive dependency from Hadoop. From 
> Hadoop 3.3 they migrated from commons-lang to commons-lang3 which fails the 
> build if parquet-mr is built against it.
> We shall either properly declare our direct dependency to commons-lang (or 
> with also migrating to commons-lang3) or refactor the code to not use 
> commons-lang at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] gszadovszky commented on issue #787: PARQUET-1844: Eliminate using commons-lang

2020-04-21 Thread GitBox


gszadovszky commented on issue #787:
URL: https://github.com/apache/parquet-mr/pull/787#issuecomment-617188187


   @shangxinli, could you please check if you have some time? It is required 
for hadoop 3.3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088672#comment-17088672
 ] 

ASF GitHub Bot commented on PARQUET-1847:
-

chenjunjiedada commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617173203


   @wesm , I guess you are familiar with infra setup. Would you please help to 
list the options and start a vote thread for this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter out github notification from dev mail list
> -
>
> Key: PARQUET-1847
> URL: https://issues.apache.org/jira/browse/PARQUET-1847
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] chenjunjiedada commented on issue #788: PARQUET-1847: Filter out github notification from dev mail list

2020-04-21 Thread GitBox


chenjunjiedada commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617173203


   @wesm , I guess you are familiar with infra setup. Would you please help to 
list the options and start a vote thread for this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088668#comment-17088668
 ] 

ASF GitHub Bot commented on PARQUET-1847:
-

chenjunjiedada edited a comment on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617169461


   Got it. Maybe we should list solutions and start a vote on dev.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter out github notification from dev mail list
> -
>
> Key: PARQUET-1847
> URL: https://issues.apache.org/jira/browse/PARQUET-1847
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] chenjunjiedada edited a comment on issue #788: PARQUET-1847: Filter out github notification from dev mail list

2020-04-21 Thread GitBox


chenjunjiedada edited a comment on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617169461


   Got it. Maybe we should list solutions and start a vote on dev.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] chenjunjiedada commented on issue #788: PARQUET-1847: Filter out github notification from dev mail list

2020-04-21 Thread GitBox


chenjunjiedada commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617169461


   Got it. Maybe we need a vote to choose solutions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088667#comment-17088667
 ] 

ASF GitHub Bot commented on PARQUET-1847:
-

chenjunjiedada commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617169461


   Got it. Maybe we need a vote to choose solutions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter out github notification from dev mail list
> -
>
> Key: PARQUET-1847
> URL: https://issues.apache.org/jira/browse/PARQUET-1847
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088657#comment-17088657
 ] 

ASF GitHub Bot commented on PARQUET-1847:
-

wesm commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617166374


   github@ doesn't exist for parquet.a.o so either this traffic should be 
directed to another mailing list, or a new one should be created. 
   
   Pausing for a second, we should probably confirm on dev@ where people want 
these e-mails to go



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter out github notification from dev mail list
> -
>
> Key: PARQUET-1847
> URL: https://issues.apache.org/jira/browse/PARQUET-1847
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] wesm commented on issue #788: PARQUET-1847: Filter out github notification from dev mail list

2020-04-21 Thread GitBox


wesm commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617166374


   github@ doesn't exist for parquet.a.o so either this traffic should be 
directed to another mailing list, or a new one should be created. 
   
   Pausing for a second, we should probably confirm on dev@ where people want 
these e-mails to go



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088654#comment-17088654
 ] 

ASF GitHub Bot commented on PARQUET-1847:
-

chenjunjiedada commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617163785


   @wesm, Could you please help to take a look? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter out github notification from dev mail list
> -
>
> Key: PARQUET-1847
> URL: https://issues.apache.org/jira/browse/PARQUET-1847
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] chenjunjiedada commented on issue #788: PARQUET-1847: Filter out github notification from dev mail list

2020-04-21 Thread GitBox


chenjunjiedada commented on issue #788:
URL: https://github.com/apache/parquet-mr/pull/788#issuecomment-617163785


   @wesm, Could you please help to take a look? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: Filtering GitBox e-mails out of dev@?

2020-04-21 Thread Junjie Chen
Just open a Jira and copy .asf.yaml in PR:
https://github.com/apache/parquet-mr/pull/788.

On Tue, Apr 21, 2020 at 8:28 PM Wes McKinney  wrote:

> hi,
>
> Would someone please take a look at this?
>
> Thanks
>
> On Mon, Apr 20, 2020 at 8:08 AM Wes McKinney  wrote:
> >
> > Infra made some changes to ensure that GitHub notifications are
> > archived, but that has resulted in new e-mails being sent to dev@
> >
> > In Arrow, we didn't want these so we have
> >
> > * https://issues.apache.org/jira/browse/INFRA-20149
> > * https://issues.apache.org/jira/browse/ARROW-8520
> > * Final solution:
> >
> https://github.com/apache/arrow/commit/aa55967e6b9cf6fc8b4d2f6ac9ec75f8c28c80f5
> >
> > You may want to implement the same thing for apache/parquet-mr
> >
> > - Wes
>


-- 
Best Regards


[GitHub] [parquet-mr] chenjunjiedada opened a new pull request #788: PARQUET-1847: Filter out github notification from dev mail list

2020-04-21 Thread GitBox


chenjunjiedada opened a new pull request #788:
URL: https://github.com/apache/parquet-mr/pull/788


   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088653#comment-17088653
 ] 

ASF GitHub Bot commented on PARQUET-1847:
-

chenjunjiedada opened a new pull request #788:
URL: https://github.com/apache/parquet-mr/pull/788


   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Filter out github notification from dev mail list
> -
>
> Key: PARQUET-1847
> URL: https://issues.apache.org/jira/browse/PARQUET-1847
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1847) Filter out github notification from dev mail list

2020-04-21 Thread Junjie Chen (Jira)
Junjie Chen created PARQUET-1847:


 Summary: Filter out github notification from dev mail list
 Key: PARQUET-1847
 URL: https://issues.apache.org/jira/browse/PARQUET-1847
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Junjie Chen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Filtering GitBox e-mails out of dev@?

2020-04-21 Thread Wes McKinney
hi,

Would someone please take a look at this?

Thanks

On Mon, Apr 20, 2020 at 8:08 AM Wes McKinney  wrote:
>
> Infra made some changes to ensure that GitHub notifications are
> archived, but that has resulted in new e-mails being sent to dev@
>
> In Arrow, we didn't want these so we have
>
> * https://issues.apache.org/jira/browse/INFRA-20149
> * https://issues.apache.org/jira/browse/ARROW-8520
> * Final solution:
> https://github.com/apache/arrow/commit/aa55967e6b9cf6fc8b4d2f6ac9ec75f8c28c80f5
>
> You may want to implement the same thing for apache/parquet-mr
>
> - Wes


Re: Parquet - 41

2020-04-21 Thread Lekshmi Narayanan, Arun Balajiee
Yes. I would like to contribute to bloom filters in Arrow

I also wanted to check, would it be a good idea to add Bloom filters in Column 
Indices ( 
PARQUET-1404
 )

Regards
Arun Balajiee


From: Junjie Chen 
Sent: 20 April 2020 22:20
To: dev@parquet.apache.org 
Subject: Re: Parquet - 41

As far as I know, not implemented yet. The thrift is update-to-date now,
would you like to contribute?

Things we need are:
1. xxhash c++ implementation
2. reader and writer for the bloom filter
3. filtering logic for row group

Implementing the reader would be a good start.

On Tue, Apr 21, 2020 at 8:52 AM  wrote:

> Hi
>
> Is the  C++ version of bloom filter implemented in Arrow Parquet C++?
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-41data=02%7C01%7CARL122%40pitt.edu%7C077d6ee2886a4fa6aa9908d7e59b839a%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637230328401496549sdata=hCOC43WB5QLrk3nbM19kp%2BrSrllsrI3LuCUF6oiIYu4%3Dreserved=0
> [PARQUET-41] Add bloom filters to parquet statistics - ASF JIRA<
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-41data=02%7C01%7CARL122%40pitt.edu%7C077d6ee2886a4fa6aa9908d7e59b839a%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637230328401496549sdata=hCOC43WB5QLrk3nbM19kp%2BrSrllsrI3LuCUF6oiIYu4%3Dreserved=0>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups. Pull request:
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdata=02%7C01%7CARL122%40pitt.edu%7C077d6ee2886a4fa6aa9908d7e59b839a%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637230328401496549sdata=9XFJB4y9X%2FpeWAqpO%2BQdJnHM6oXYRU37lZ0XhodRlxc%3Dreserved=0
>  ...
> issues.apache.org
> Regards
>


--
Best Regards


[jira] [Commented] (PARQUET-1844) Removed Hadoop transitive dependency on commons-lang

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088534#comment-17088534
 ] 

ASF GitHub Bot commented on PARQUET-1844:
-

gszadovszky opened a new pull request #787:
URL: https://github.com/apache/parquet-mr/pull/787


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Removed Hadoop transitive dependency on commons-lang
> 
>
> Key: PARQUET-1844
> URL: https://issues.apache.org/jira/browse/PARQUET-1844
> Project: Parquet
>  Issue Type: Task
>Reporter: Gabor Szadovszky
>Priority: Major
>
> Some of our code parts are using commons-lang without declaring direct 
> dependency on it. It comes as a transitive dependency from Hadoop. From 
> Hadoop 3.3 they migrated from commons-lang to commons-lang3 which fails the 
> build if parquet-mr is built against it.
> We shall either properly declare our direct dependency to commons-lang (or 
> with also migrating to commons-lang3) or refactor the code to not use 
> commons-lang at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] gszadovszky opened a new pull request #787: PARQUET-1844: Eliminate using commons-lang

2020-04-21 Thread GitBox


gszadovszky opened a new pull request #787:
URL: https://github.com/apache/parquet-mr/pull/787


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1229) parquet-mr code changes for encryption support

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088391#comment-17088391
 ] 

ASF GitHub Bot commented on PARQUET-1229:
-

ggershinsky commented on issue #776:
URL: https://github.com/apache/parquet-mr/pull/776#issuecomment-617009069


   > can you squash it to one single commit to make the review easier?
   
   This can be reviewed as a single commit at
   https://github.com/apache/parquet-mr/pull/776/files
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parquet-mr code changes for encryption support
> --
>
> Key: PARQUET-1229
> URL: https://issues.apache.org/jira/browse/PARQUET-1229
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> Addition of encryption/decryption support to the existing Parquet classes and 
> APIs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] ggershinsky commented on issue #776: PARQUET-1229: Parquet MR encryption

2020-04-21 Thread GitBox


ggershinsky commented on issue #776:
URL: https://github.com/apache/parquet-mr/pull/776#issuecomment-617009069


   > can you squash it to one single commit to make the review easier?
   
   This can be reviewed as a single commit at
   https://github.com/apache/parquet-mr/pull/776/files
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1229) parquet-mr code changes for encryption support

2020-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088389#comment-17088389
 ] 

ASF GitHub Bot commented on PARQUET-1229:
-

ggershinsky commented on issue #776:
URL: https://github.com/apache/parquet-mr/pull/776#issuecomment-617008320


   preferably reviewed after the Travis fix is in. @gszadovszky @shangxinli can 
you apply #777 to the encryption branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parquet-mr code changes for encryption support
> --
>
> Key: PARQUET-1229
> URL: https://issues.apache.org/jira/browse/PARQUET-1229
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> Addition of encryption/decryption support to the existing Parquet classes and 
> APIs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] ggershinsky commented on issue #776: PARQUET-1229: Parquet MR encryption

2020-04-21 Thread GitBox


ggershinsky commented on issue #776:
URL: https://github.com/apache/parquet-mr/pull/776#issuecomment-617008320


   preferably reviewed after the Travis fix is in. @gszadovszky @shangxinli can 
you apply #777 to the encryption branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org