date:20230921

[jira] [Commented] (PARQUET-1381) Add merge blocks command to parquet-tools

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767786#comment-17767786
 ] 

ASF GitHub Bot commented on PARQUET-1381:
-

shangxinli commented on PR #1121:
URL: https://github.com/apache/parquet-mr/pull/1121#issuecomment-1730734276

   This is a great initiative. Do you still have plan to address the feedback 
@MaheshGPai ? 




> Add merge blocks command to parquet-tools
> -
>
> Key: PARQUET-1381
> URL: https://issues.apache.org/jira/browse/PARQUET-1381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ekaterina Galieva
>Assignee: Ekaterina Galieva
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation of merge command in parquet-tools doesn't merge row 
> groups, just places one after the other. Add API and command option to be 
> able to merge small blocks into larger ones up to specified size limit.
> h6. Implementation details:
> Blocks are not reordered not to break possible initial predicate pushdown 
> optimizations.
> Blocks are not divided to fit upper bound perfectly. 
> This is an intentional performance optimization. 
> This gives an opportunity to form new blocks by coping full content of 
> smaller blocks by column, not by row.
> h6. Examples:
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [128 | 40], [120]{code}
> Expected output file blocks sizes:
> {{merge }}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b}}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b -l 256 }}
> {code:java}
> [163 | 168 | 120]
> {code}
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [40], [120], [6] {code}
> Expected output file blocks sizes:
> {{merge}}
> {code:java}
> [128 | 35 | 40 | 120 | 6] 
> {code}
> {{merge -b}}
> {code:java}
> [128 | 75 | 126] 
> {code}
> {{merge -b -l 256}}
> {code:java}
> [203 | 126]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] shangxinli commented on pull request #1121: PARQUET-1381: Support merging of rowgroups during file rewrite

2023-09-21 Thread via GitHub



shangxinli commented on PR #1121:
URL: https://github.com/apache/parquet-mr/pull/1121#issuecomment-1730734276

   This is a great initiative. Do you still have plan to address the feedback 
@MaheshGPai ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767784#comment-17767784
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-

mapleFU commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333823691


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   Might not directly related to this question. If the column is a utf-8 column 
( with LogicalType `UTF8` ), can the binary here be a non-utf8 bytes?
   
   e.g. `s[...100]` is a valid utf-8, however, `s[...50]` is not. 





> Update parquet format spec to allow truncation of row group min/max stats
> -
>
> Key: PARQUET-2352
> URL: https://issues.apache.org/jira/browse/PARQUET-2352
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raunaq Morarka
>Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

2023-09-21 Thread via GitHub



mapleFU commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333823691


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   Might not directly related to this question. If the column is a utf-8 column 
( with LogicalType `UTF8` ), can the binary here be a non-utf8 bytes?
   
   e.g. `s[...100]` is a valid utf-8, however, `s[...50]` is not. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767772#comment-17767772
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-

wgtmac commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333778393


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   I think they are relevant and can be discussed together.
   
   What do you think? @gszadovszky @shangxinli 





> Update parquet format spec to allow truncation of row group min/max stats
> -
>
> Key: PARQUET-2352
> URL: https://issues.apache.org/jira/browse/PARQUET-2352
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raunaq Morarka
>Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

2023-09-21 Thread via GitHub



wgtmac commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333778393


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   I think they are relevant and can be discussed together.
   
   What do you think? @gszadovszky @shangxinli 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767726#comment-17767726
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-

findepi commented on PR #216:
URL: https://github.com/apache/parquet-format/pull/216#issuecomment-1730285274

   Thanks @raunaqmorarka, this lgtm




> Update parquet format spec to allow truncation of row group min/max stats
> -
>
> Key: PARQUET-2352
> URL: https://issues.apache.org/jira/browse/PARQUET-2352
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raunaq Morarka
>Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] findepi commented on pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

2023-09-21 Thread via GitHub



findepi commented on PR #216:
URL: https://github.com/apache/parquet-format/pull/216#issuecomment-1730285274

   Thanks @raunaqmorarka, this lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: Parquet dictionary size limits?

2023-09-21 Thread Claire McGinty

Hmm, like a flag to basically turn off the isCompressionSatisfying check
per-column? That might be simplest!

So to summarize, a column will not write a dictionary encoding when either:

(1) `parquet.enable.dictionary` is set to False
(2) # of distinct values in a column chunk exceeds
`DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
Integer.MAX_VALUE)
(3) Total encoded bytes in a dictionary exceed the value of
`parquet.dictionary.page.size`
(4) Desired compression ratio (as a measure of # distinct values : total #
values) is not achieved

I might try out various options for making (4) configurable, starting with
your suggestion, and testing them out on more realistic data distributions.
Will try to return to this thread with my results in a few days :)

Best,
Claire


On Thu, Sep 21, 2023 at 8:48 AM Gang Wu  wrote:

> The current implementation only checks the first page, which is
> vulnerable in many cases. I think your suggestion makes sense.
> However, there is no one-fit-for-all solution. How about simply
> adding a flag to enforce dictionary encoding to a specific column?
>
>
> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty  >
> wrote:
>
> > I think I figured it out! The dictionaryByteSize == 0 was a red herring;
> I
> > was looking at an IntegerDictionaryValuesWriter for an empty column
> rather
> > than my high-cardinality column. Your analysis of the situation was
> > right--it was just that in the first page, there weren't enough distinct
> > values to pass the check.
> >
> > I wonder if we could maybe make this value configurable per-column?
> Either:
> >
> > - A desired ratio of distinct values / total values, on a scale of 0-1.0
> > - Number of pages to check for compression before falling back
> >
> > Let me know what you think!
> >
> > Best,
> > Claire
> >
> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu  wrote:
> >
> > > I don't understand why you get encodedSize == 1, dictionaryByteSize ==
> 0
> > > and rawSize == 0 in the first page check. It seems that the page does
> not
> > > have any meaning values. Could you please check how many values are
> > > written before the page check?
> > >
> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > > claire.d.mcgi...@gmail.com>
> > > wrote:
> > >
> > > > Hey Gang,
> > > >
> > > > Thanks for the followup! I see what you're saying where it's
> sometimes
> > > just
> > > > bad luck with what ends up in the first page. The intuition seems
> like
> > a
> > > > larger page size should produce a better encoding in this case... I
> > > updated
> > > > my branch
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > >
> > > > to
> > > > add a test with a page size/dict page size of 10MB and am seeing the
> > same
> > > > failure, though.
> > > >
> > > > Something seems kind of odd actually -- when I stepped through the
> > test I
> > > > added w/ debugger, it falls back after invoking
> isCompressionSatisfying
> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 +
> 0
> > <
> > > 1
> > > > returns true. (You can also see this in the System.out logs I added,
> in
> > > the
> > > > branch's GHA run logs). This doesn't seem right to me -- does
> > > > isCompressionSatsifying need an extra check to make sure the
> > > > dictionary isn't empty?
> > > >
> > > > Also, thanks, Aaron! I got into this while running some
> > micro-benchmarks
> > > on
> > > > Parquet reads when various dictionary/bloom filter/encoding options
> are
> > > > configured. Happy to share out when I'm done.
> > > >
> > > > Best,
> > > > Claire
> > > >
> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu  wrote:
> > > >
> > > > > Thanks for the investigation!
> > > > >
> > > > > I think the check below makes sense for a single page:
> > > > >   @Override
> > > > >   public boolean isCompressionSatisfying(long rawSize, long
> > > encodedSize)
> > > > {
> > > > > return (encodedSize + dictionaryByteSize) < rawSize;
> > > > >   }
> > > > >
> > > > > The problem is that the fallback check is only performed on the
> first
> > > > page.
> > > > > In the first page, all values in that page may be distinct, so it
> > will
> > > > > unlikely
> > > > > pass the isCompressionSatisfying check.
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > >
> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > > > >  wrote:
> > > > >
> > > > > > Claire, thank you for your research and examples on this topic,
> > I've
> > > > > > learned a lot.  My hunch is that your change would be a good one,
> > but
> > > > I'm
> > > > > > not an expert (and more to the point, not a committer).  I'm
> > looking
> > > > > > forward to learning more as this discussion continues.
> > > > > >
> > > > > > Thank you again, Aaron
> > > > > >
> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > > > > claire.d.mcgi...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > >

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767635#comment-17767635
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-

raunaqmorarka commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r128796


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I 
want to assume that all existing stats are truncated. Going forward we should 
have a flag to explicitly indicate whether or not truncation took place and 
applications should perform aggregation pushdown only if that flag is found to 
indicate no truncation. But I think adding that flag can be tackled separately 
as a follow-up.





> Update parquet format spec to allow truncation of row group min/max stats
> -
>
> Key: PARQUET-2352
> URL: https://issues.apache.org/jira/browse/PARQUET-2352
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raunaq Morarka
>Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] raunaqmorarka commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

2023-09-21 Thread via GitHub



raunaqmorarka commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r128796


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I 
want to assume that all existing stats are truncated. Going forward we should 
have a flag to explicitly indicate whether or not truncation took place and 
applications should perform aggregation pushdown only if that flag is found to 
indicate no truncation. But I think adding that flag can be tackled separately 
as a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767618#comment-17767618
 ] 

ASF GitHub Bot commented on PARQUET-2352:
-

wgtmac commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333286445


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   IIRC, you want to allow truncation by adding a flag to indicate whether the 
min_value/max_value is truncated or not, right?





> Update parquet format spec to allow truncation of row group min/max stats
> -
>
> Key: PARQUET-2352
> URL: https://issues.apache.org/jira/browse/PARQUET-2352
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raunaq Morarka
>Priority: Major
>
> Column index stats are explicitly allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
>  although it is not explicitly clarified like in the column index case. This 
> forces implementations to either drop min/max row group stats for columns 
> with long strings and miss opportunities for filtering row groups or 
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to 
> parquet-mr which allows users to deviate from spec and configure truncation 
> of min/max row group stats. Unfortunately, there is no way for readers to 
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly 
> detect, attempts to pushdown min/max aggregation to parquet have avoided 
> implementing it for string columns (e.g. 
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of 
> min/max row group stats. This would align the spec with current reality that 
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max 
> stats are truncated. Reader implementations could then safely implement 
> min/max aggregation pushdown to strings for new data going forward by 
> checking the value of this flag. When the flag is not found on existing data 
> then it could be assumed that the data could be truncated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

2023-09-21 Thread via GitHub



wgtmac commented on code in PR #216:
URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333286445


##
src/main/thrift/parquet.thrift:
##
@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
-* Min and max values for the column, determined by its ColumnOrder.
+* lower and upper bound values for the column, determined by its 
ColumnOrder.

Review Comment:
   IIRC, you want to allow truncation by adding a flag to indicate whether the 
min_value/max_value is truncated or not, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: Parquet dictionary size limits?

2023-09-21 Thread Gang Wu

The current implementation only checks the first page, which is
vulnerable in many cases. I think your suggestion makes sense.
However, there is no one-fit-for-all solution. How about simply
adding a flag to enforce dictionary encoding to a specific column?


On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty 
wrote:

> I think I figured it out! The dictionaryByteSize == 0 was a red herring; I
> was looking at an IntegerDictionaryValuesWriter for an empty column rather
> than my high-cardinality column. Your analysis of the situation was
> right--it was just that in the first page, there weren't enough distinct
> values to pass the check.
>
> I wonder if we could maybe make this value configurable per-column? Either:
>
> - A desired ratio of distinct values / total values, on a scale of 0-1.0
> - Number of pages to check for compression before falling back
>
> Let me know what you think!
>
> Best,
> Claire
>
> On Wed, Sep 20, 2023 at 9:37 AM Gang Wu  wrote:
>
> > I don't understand why you get encodedSize == 1, dictionaryByteSize == 0
> > and rawSize == 0 in the first page check. It seems that the page does not
> > have any meaning values. Could you please check how many values are
> > written before the page check?
> >
> > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > claire.d.mcgi...@gmail.com>
> > wrote:
> >
> > > Hey Gang,
> > >
> > > Thanks for the followup! I see what you're saying where it's sometimes
> > just
> > > bad luck with what ends up in the first page. The intuition seems like
> a
> > > larger page size should produce a better encoding in this case... I
> > updated
> > > my branch
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > >
> > > to
> > > add a test with a page size/dict page size of 10MB and am seeing the
> same
> > > failure, though.
> > >
> > > Something seems kind of odd actually -- when I stepped through the
> test I
> > > added w/ debugger, it falls back after invoking isCompressionSatisfying
> > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0
> <
> > 1
> > > returns true. (You can also see this in the System.out logs I added, in
> > the
> > > branch's GHA run logs). This doesn't seem right to me -- does
> > > isCompressionSatsifying need an extra check to make sure the
> > > dictionary isn't empty?
> > >
> > > Also, thanks, Aaron! I got into this while running some
> micro-benchmarks
> > on
> > > Parquet reads when various dictionary/bloom filter/encoding options are
> > > configured. Happy to share out when I'm done.
> > >
> > > Best,
> > > Claire
> > >
> > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu  wrote:
> > >
> > > > Thanks for the investigation!
> > > >
> > > > I think the check below makes sense for a single page:
> > > >   @Override
> > > >   public boolean isCompressionSatisfying(long rawSize, long
> > encodedSize)
> > > {
> > > > return (encodedSize + dictionaryByteSize) < rawSize;
> > > >   }
> > > >
> > > > The problem is that the fallback check is only performed on the first
> > > page.
> > > > In the first page, all values in that page may be distinct, so it
> will
> > > > unlikely
> > > > pass the isCompressionSatisfying check.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > >
> > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > > >  wrote:
> > > >
> > > > > Claire, thank you for your research and examples on this topic,
> I've
> > > > > learned a lot.  My hunch is that your change would be a good one,
> but
> > > I'm
> > > > > not an expert (and more to the point, not a committer).  I'm
> looking
> > > > > forward to learning more as this discussion continues.
> > > > >
> > > > > Thank you again, Aaron
> > > > >
> > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > > > claire.d.mcgi...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > I created a quick branch
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > > > >
> > > > > > to reproduce what I'm seeing -- the test shows that an Int column
> > > with
> > > > > > cardinality 100 successfully results in a dict encoding, but an
> int
> > > > > column
> > > > > > with cardinality 10,000 falls back and doesn't create a dict
> > > encoding.
> > > > > This
> > > > > > seems like a low threshold given the 1MB dictionary page size,
> so I
> > > > just
> > > > > > wanted to check whether this is expected or not :)
> > > > > >
> > > > > > Best,
> > > > > > Claire
> > > > > >
> > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > > > > claire.d.mcgi...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi, just wanted to follow up on this!
> > > > > > >
> > > > > > > I ran a debugger to find out why my column wasn't ending up
> with
> > a
> > > > > > > dictionary encoding and it turns out that even though
> > > > > > >

[GitHub] [parquet-mr] Fokko merged pull request #1147: Bump jmh.version from 1.21 to 1.37

2023-09-21 Thread via GitHub



Fokko merged PR #1147:
URL: https://github.com/apache/parquet-mr/pull/1147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2313) Bump actions/setup-java from 1 to 3

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767530#comment-17767530
 ] 

ASF GitHub Bot commented on PARQUET-2313:
-

Fokko merged PR #203:
URL: https://github.com/apache/parquet-format/pull/203




> Bump actions/setup-java from 1 to 3
> ---
>
> Key: PARQUET-2313
> URL: https://issues.apache.org/jira/browse/PARQUET-2313
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Gang Wu
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-format] Fokko merged pull request #203: PARQUET-2313: Bump actions/setup-java from 1 to 3

2023-09-21 Thread via GitHub



Fokko merged PR #203:
URL: https://github.com/apache/parquet-format/pull/203


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2268) Bump Thrift to 0.18.1

2023-09-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767526#comment-17767526
 ] 

ASF GitHub Bot commented on PARQUET-2268:
-

Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1
URL: https://github.com/apache/parquet-mr/pull/1054




> Bump Thrift to 0.18.1
> -
>
> Key: PARQUET-2268
> URL: https://issues.apache.org/jira/browse/PARQUET-2268
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1

2023-09-21 Thread via GitHub



Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1
URL: https://github.com/apache/parquet-mr/pull/1054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] Fokko opened a new pull request, #1148: Remove PR limit

2023-09-21 Thread via GitHub



Fokko opened a new pull request, #1148:
URL: https://github.com/apache/parquet-mr/pull/1148

   Dependabot is effectively not working because
   5 PRs have been open for ages.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] dependabot[bot] commented on pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4

2023-09-21 Thread via GitHub



dependabot[bot] commented on PR #1110:
URL: https://github.com/apache/parquet-mr/pull/1110#issuecomment-1729435128

   OK, I won't notify you about version 3.3.x again, unless you re-open this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] dependabot[bot] closed pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4

2023-09-21 Thread via GitHub



dependabot[bot] closed pull request #1110: Bump scalatest_2.12 from 3.0.1 to 
3.3.0-SNAP4
URL: https://github.com/apache/parquet-mr/pull/1110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] Fokko commented on pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4

2023-09-21 Thread via GitHub



Fokko commented on PR #1110:
URL: https://github.com/apache/parquet-mr/pull/1110#issuecomment-1729435056

   @dependabot ignore this minor version


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] Fokko commented on pull request #1068: Bump elephant-bird.version from 4.4 to 4.17

2023-09-21 Thread via GitHub



Fokko commented on PR #1068:
URL: https://github.com/apache/parquet-mr/pull/1068#issuecomment-1729434254

   @lukasnalezenec what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] dependabot[bot] closed pull request #1062: Bump jmh.version from 1.21 to 1.36

2023-09-21 Thread via GitHub



dependabot[bot] closed pull request #1062: Bump jmh.version from 1.21 to 1.36
URL: https://github.com/apache/parquet-mr/pull/1062


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] dependabot[bot] commented on pull request #1062: Bump jmh.version from 1.21 to 1.36

2023-09-21 Thread via GitHub



dependabot[bot] commented on PR #1062:
URL: https://github.com/apache/parquet-mr/pull/1062#issuecomment-1729427261

   Superseded by #1147.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] dependabot[bot] opened a new pull request, #1147: Bump jmh.version from 1.21 to 1.37

2023-09-21 Thread via GitHub



dependabot[bot] opened a new pull request, #1147:
URL: https://github.com/apache/parquet-mr/pull/1147

   Bumps `jmh.version` from 1.21 to 1.37.
   Updates `org.openjdk.jmh:jmh-core` from 1.21 to 1.37
   
   Commits
   
   https://github.com/openjdk/jmh/commit/2effa2c8310e1d3ad03c8ee02024edca9252b46a;>2effa2c
 JMH v1.37.
   https://github.com/openjdk/jmh/commit/09c78d5d0752ffc409e64ca4cabe9dc7b96704d7;>09c78d5
 7903508: JMH: Remove the Unicode dot prefix from secondary results
   https://github.com/openjdk/jmh/commit/843f64123bc25b0402e78b863999dd7c69adb309;>843f641
 7903510: JMH: Add core performance checking tests
   https://github.com/openjdk/jmh/commit/8bc325b2de5728077a62e9e32ea3ff15189636f8;>8bc325b
 7903511: JMH: Add score stability performance tests
   https://github.com/openjdk/jmh/commit/6b09724579b8ae58554e15a1bc5badb543da5bd2;>6b09724
 7903450: JMH: Improve -prof perfnorm accuracy with robust estimations
   https://github.com/openjdk/jmh/commit/d88f901b2a50539e276aa409b5d7ce1eb3e1bfc9;>d88f901
 7903504: JMH: Fix new Sonar warnings
   https://github.com/openjdk/jmh/commit/47f651b72d05c2c335f8ced5ed33f2fb0dd26720;>47f651b
 7903498: JMH: Reset worker interrupt status after iteration
   https://github.com/openjdk/jmh/commit/482561a2be24e47f1c3a855b3ce69f56130ec57e;>482561a
 7903492: JMH: Infrastructure code should yield occasionally for virtual 
execu...
   https://github.com/openjdk/jmh/commit/9a9755714746f76bfaaa067e777457ce6ef64bf1;>9a97557
 7903490: JMH: The interrupt to time-outing benchmark can be delivered to 
work...
   https://github.com/openjdk/jmh/commit/bf8db38250af9435a13dde822df22c3aee6dc2bb;>bf8db38
 7903487: JMH: Make sure JMH profilers work on all tested configurations
   Additional commits viewable in https://github.com/openjdk/jmh/compare/1.21...1.37;>compare view
   
   
   
   
   Updates `org.openjdk.jmh:jmh-generator-annprocess` from 1.21 to 1.37
   
   Commits
   
   https://github.com/openjdk/jmh/commit/2effa2c8310e1d3ad03c8ee02024edca9252b46a;>2effa2c
 JMH v1.37.
   https://github.com/openjdk/jmh/commit/09c78d5d0752ffc409e64ca4cabe9dc7b96704d7;>09c78d5
 7903508: JMH: Remove the Unicode dot prefix from secondary results
   https://github.com/openjdk/jmh/commit/843f64123bc25b0402e78b863999dd7c69adb309;>843f641
 7903510: JMH: Add core performance checking tests
   https://github.com/openjdk/jmh/commit/8bc325b2de5728077a62e9e32ea3ff15189636f8;>8bc325b
 7903511: JMH: Add score stability performance tests
   https://github.com/openjdk/jmh/commit/6b09724579b8ae58554e15a1bc5badb543da5bd2;>6b09724
 7903450: JMH: Improve -prof perfnorm accuracy with robust estimations
   https://github.com/openjdk/jmh/commit/d88f901b2a50539e276aa409b5d7ce1eb3e1bfc9;>d88f901
 7903504: JMH: Fix new Sonar warnings
   https://github.com/openjdk/jmh/commit/47f651b72d05c2c335f8ced5ed33f2fb0dd26720;>47f651b
 7903498: JMH: Reset worker interrupt status after iteration
   https://github.com/openjdk/jmh/commit/482561a2be24e47f1c3a855b3ce69f56130ec57e;>482561a
 7903492: JMH: Infrastructure code should yield occasionally for virtual 
execu...
   https://github.com/openjdk/jmh/commit/9a9755714746f76bfaaa067e777457ce6ef64bf1;>9a97557
 7903490: JMH: The interrupt to time-outing benchmark can be delivered to 
work...
   https://github.com/openjdk/jmh/commit/bf8db38250af9435a13dde822df22c3aee6dc2bb;>bf8db38
 7903487: JMH: Make sure JMH profilers work on all tested configurations
   Additional commits viewable in https://github.com/openjdk/jmh/compare/1.21...1.37;>compare view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it

[GitHub] [parquet-mr] Fokko commented on pull request #1062: Bump jmh.version from 1.21 to 1.36

2023-09-21 Thread via GitHub



Fokko commented on PR #1062:
URL: https://github.com/apache/parquet-mr/pull/1062#issuecomment-1729426235

   @dependabot rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] Fokko opened a new pull request, #1146: Modest refactor of ParquetFileWriter

2023-09-21 Thread via GitHub



Fokko opened a new pull request, #1146:
URL: https://github.com/apache/parquet-mr/pull/1146

   IDEA was lighting up as a christmas tree :)
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] Fokko opened a new pull request, #1145: Use try-with-resources

2023-09-21 Thread via GitHub



Fokko opened a new pull request, #1145:
URL: https://github.com/apache/parquet-mr/pull/1145

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] Fokko opened a new pull request, #1144: Remove old setting for cascading

2023-09-21 Thread via GitHub



Fokko opened a new pull request, #1144:
URL: https://github.com/apache/parquet-mr/pull/1144

   Make sure you have checked _all_ steps below. 
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1381) Add merge blocks command to parquet-tools

[GitHub] [parquet-mr] shangxinli commented on pull request #1121: PARQUET-1381: Support merging of rowgroups during file rewrite

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

[GitHub] [parquet-format] findepi commented on pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

Re: Parquet dictionary size limits?

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

[GitHub] [parquet-format] raunaqmorarka commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics

Re: Parquet dictionary size limits?

[GitHub] [parquet-mr] Fokko merged pull request #1147: Bump jmh.version from 1.21 to 1.37

[jira] [Commented] (PARQUET-2313) Bump actions/setup-java from 1 to 3

[GitHub] [parquet-format] Fokko merged pull request #203: PARQUET-2313: Bump actions/setup-java from 1 to 3

[jira] [Commented] (PARQUET-2268) Bump Thrift to 0.18.1

[GitHub] [parquet-mr] Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1

[GitHub] [parquet-mr] Fokko opened a new pull request, #1148: Remove PR limit

[GitHub] [parquet-mr] dependabot[bot] commented on pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4

[GitHub] [parquet-mr] dependabot[bot] closed pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4

[GitHub] [parquet-mr] Fokko commented on pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4

[GitHub] [parquet-mr] Fokko commented on pull request #1068: Bump elephant-bird.version from 4.4 to 4.17

[GitHub] [parquet-mr] dependabot[bot] closed pull request #1062: Bump jmh.version from 1.21 to 1.36

[GitHub] [parquet-mr] dependabot[bot] commented on pull request #1062: Bump jmh.version from 1.21 to 1.36

[GitHub] [parquet-mr] dependabot[bot] opened a new pull request, #1147: Bump jmh.version from 1.21 to 1.37

[GitHub] [parquet-mr] Fokko commented on pull request #1062: Bump jmh.version from 1.21 to 1.36

[GitHub] [parquet-mr] Fokko opened a new pull request, #1146: Modest refactor of ParquetFileWriter

[GitHub] [parquet-mr] Fokko opened a new pull request, #1145: Use try-with-resources

[GitHub] [parquet-mr] Fokko opened a new pull request, #1144: Remove old setting for cascading

31 matches

Site Navigation

Mail list logo

Footer information