[jira] [Commented] (PARQUET-1381) Add merge blocks command to parquet-tools
[ https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767786#comment-17767786 ] ASF GitHub Bot commented on PARQUET-1381: - shangxinli commented on PR #1121: URL: https://github.com/apache/parquet-mr/pull/1121#issuecomment-1730734276 This is a great initiative. Do you still have plan to address the feedback @MaheshGPai ? > Add merge blocks command to parquet-tools > - > > Key: PARQUET-1381 > URL: https://issues.apache.org/jira/browse/PARQUET-1381 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.10.0 >Reporter: Ekaterina Galieva >Assignee: Ekaterina Galieva >Priority: Major > Labels: pull-request-available > > Current implementation of merge command in parquet-tools doesn't merge row > groups, just places one after the other. Add API and command option to be > able to merge small blocks into larger ones up to specified size limit. > h6. Implementation details: > Blocks are not reordered not to break possible initial predicate pushdown > optimizations. > Blocks are not divided to fit upper bound perfectly. > This is an intentional performance optimization. > This gives an opportunity to form new blocks by coping full content of > smaller blocks by column, not by row. > h6. Examples: > # Input files with blocks sizes: > {code:java} > [128 | 35], [128 | 40], [120]{code} > Expected output file blocks sizes: > {{merge }} > {code:java} > [128 | 35 | 128 | 40 | 120] > {code} > {{merge -b}} > {code:java} > [128 | 35 | 128 | 40 | 120] > {code} > {{merge -b -l 256 }} > {code:java} > [163 | 168 | 120] > {code} > # Input files with blocks sizes: > {code:java} > [128 | 35], [40], [120], [6] {code} > Expected output file blocks sizes: > {{merge}} > {code:java} > [128 | 35 | 40 | 120 | 6] > {code} > {{merge -b}} > {code:java} > [128 | 75 | 126] > {code} > {{merge -b -l 256}} > {code:java} > [203 | 126]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] shangxinli commented on pull request #1121: PARQUET-1381: Support merging of rowgroups during file rewrite
shangxinli commented on PR #1121: URL: https://github.com/apache/parquet-mr/pull/1121#issuecomment-1730734276 This is a great initiative. Do you still have plan to address the feedback @MaheshGPai ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats
[ https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767784#comment-17767784 ] ASF GitHub Bot commented on PARQUET-2352: - mapleFU commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333823691 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: Might not directly related to this question. If the column is a utf-8 column ( with LogicalType `UTF8` ), can the binary here be a non-utf8 bytes? e.g. `s[...100]` is a valid utf-8, however, `s[...50]` is not. > Update parquet format spec to allow truncation of row group min/max stats > - > > Key: PARQUET-2352 > URL: https://issues.apache.org/jira/browse/PARQUET-2352 > Project: Parquet > Issue Type: Improvement >Reporter: Raunaq Morarka >Priority: Major > > Column index stats are explicitly allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958] > However, it seems row group min/max stats are not allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219] > although it is not explicitly clarified like in the column index case. This > forces implementations to either drop min/max row group stats for columns > with long strings and miss opportunities for filtering row groups or > seemingly deviate from spec by truncating min/max row group stats. > https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to > parquet-mr which allows users to deviate from spec and configure truncation > of min/max row group stats. Unfortunately, there is no way for readers to > detect whether truncation took place. > Since the possibility of truncation exists and is not possible to explicitly > detect, attempts to pushdown min/max aggregation to parquet have avoided > implementing it for string columns (e.g. > https://issues.apache.org/jira/browse/SPARK-36645) > Given the above situation, the spec should be updated to allow truncation of > min/max row group stats. This would align the spec with current reality that > string column min/max row group stats could be truncated. > Additionally, a flag could be added to the stats to specify whether min/max > stats are truncated. Reader implementations could then safely implement > min/max aggregation pushdown to strings for new data going forward by > checking the value of this flag. When the flag is not found on existing data > then it could be assumed that the data could be truncated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] mapleFU commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics
mapleFU commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333823691 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: Might not directly related to this question. If the column is a utf-8 column ( with LogicalType `UTF8` ), can the binary here be a non-utf8 bytes? e.g. `s[...100]` is a valid utf-8, however, `s[...50]` is not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats
[ https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767772#comment-17767772 ] ASF GitHub Bot commented on PARQUET-2352: - wgtmac commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333778393 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: I think they are relevant and can be discussed together. What do you think? @gszadovszky @shangxinli > Update parquet format spec to allow truncation of row group min/max stats > - > > Key: PARQUET-2352 > URL: https://issues.apache.org/jira/browse/PARQUET-2352 > Project: Parquet > Issue Type: Improvement >Reporter: Raunaq Morarka >Priority: Major > > Column index stats are explicitly allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958] > However, it seems row group min/max stats are not allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219] > although it is not explicitly clarified like in the column index case. This > forces implementations to either drop min/max row group stats for columns > with long strings and miss opportunities for filtering row groups or > seemingly deviate from spec by truncating min/max row group stats. > https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to > parquet-mr which allows users to deviate from spec and configure truncation > of min/max row group stats. Unfortunately, there is no way for readers to > detect whether truncation took place. > Since the possibility of truncation exists and is not possible to explicitly > detect, attempts to pushdown min/max aggregation to parquet have avoided > implementing it for string columns (e.g. > https://issues.apache.org/jira/browse/SPARK-36645) > Given the above situation, the spec should be updated to allow truncation of > min/max row group stats. This would align the spec with current reality that > string column min/max row group stats could be truncated. > Additionally, a flag could be added to the stats to specify whether min/max > stats are truncated. Reader implementations could then safely implement > min/max aggregation pushdown to strings for new data going forward by > checking the value of this flag. When the flag is not found on existing data > then it could be assumed that the data could be truncated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] wgtmac commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics
wgtmac commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333778393 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: I think they are relevant and can be discussed together. What do you think? @gszadovszky @shangxinli -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats
[ https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767726#comment-17767726 ] ASF GitHub Bot commented on PARQUET-2352: - findepi commented on PR #216: URL: https://github.com/apache/parquet-format/pull/216#issuecomment-1730285274 Thanks @raunaqmorarka, this lgtm > Update parquet format spec to allow truncation of row group min/max stats > - > > Key: PARQUET-2352 > URL: https://issues.apache.org/jira/browse/PARQUET-2352 > Project: Parquet > Issue Type: Improvement >Reporter: Raunaq Morarka >Priority: Major > > Column index stats are explicitly allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958] > However, it seems row group min/max stats are not allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219] > although it is not explicitly clarified like in the column index case. This > forces implementations to either drop min/max row group stats for columns > with long strings and miss opportunities for filtering row groups or > seemingly deviate from spec by truncating min/max row group stats. > https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to > parquet-mr which allows users to deviate from spec and configure truncation > of min/max row group stats. Unfortunately, there is no way for readers to > detect whether truncation took place. > Since the possibility of truncation exists and is not possible to explicitly > detect, attempts to pushdown min/max aggregation to parquet have avoided > implementing it for string columns (e.g. > https://issues.apache.org/jira/browse/SPARK-36645) > Given the above situation, the spec should be updated to allow truncation of > min/max row group stats. This would align the spec with current reality that > string column min/max row group stats could be truncated. > Additionally, a flag could be added to the stats to specify whether min/max > stats are truncated. Reader implementations could then safely implement > min/max aggregation pushdown to strings for new data going forward by > checking the value of this flag. When the flag is not found on existing data > then it could be assumed that the data could be truncated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] findepi commented on pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics
findepi commented on PR #216: URL: https://github.com/apache/parquet-format/pull/216#issuecomment-1730285274 Thanks @raunaqmorarka, this lgtm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: Parquet dictionary size limits?
Hmm, like a flag to basically turn off the isCompressionSatisfying check per-column? That might be simplest! So to summarize, a column will not write a dictionary encoding when either: (1) `parquet.enable.dictionary` is set to False (2) # of distinct values in a column chunk exceeds `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to Integer.MAX_VALUE) (3) Total encoded bytes in a dictionary exceed the value of `parquet.dictionary.page.size` (4) Desired compression ratio (as a measure of # distinct values : total # values) is not achieved I might try out various options for making (4) configurable, starting with your suggestion, and testing them out on more realistic data distributions. Will try to return to this thread with my results in a few days :) Best, Claire On Thu, Sep 21, 2023 at 8:48 AM Gang Wu wrote: > The current implementation only checks the first page, which is > vulnerable in many cases. I think your suggestion makes sense. > However, there is no one-fit-for-all solution. How about simply > adding a flag to enforce dictionary encoding to a specific column? > > > On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty > > wrote: > > > I think I figured it out! The dictionaryByteSize == 0 was a red herring; > I > > was looking at an IntegerDictionaryValuesWriter for an empty column > rather > > than my high-cardinality column. Your analysis of the situation was > > right--it was just that in the first page, there weren't enough distinct > > values to pass the check. > > > > I wonder if we could maybe make this value configurable per-column? > Either: > > > > - A desired ratio of distinct values / total values, on a scale of 0-1.0 > > - Number of pages to check for compression before falling back > > > > Let me know what you think! > > > > Best, > > Claire > > > > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu wrote: > > > > > I don't understand why you get encodedSize == 1, dictionaryByteSize == > 0 > > > and rawSize == 0 in the first page check. It seems that the page does > not > > > have any meaning values. Could you please check how many values are > > > written before the page check? > > > > > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty < > > > claire.d.mcgi...@gmail.com> > > > wrote: > > > > > > > Hey Gang, > > > > > > > > Thanks for the followup! I see what you're saying where it's > sometimes > > > just > > > > bad luck with what ends up in the first page. The intuition seems > like > > a > > > > larger page size should produce a better encoding in this case... I > > > updated > > > > my branch > > > > < > > > > > > > > > > https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 > > > > > > > > > to > > > > add a test with a page size/dict page size of 10MB and am seeing the > > same > > > > failure, though. > > > > > > > > Something seems kind of odd actually -- when I stepped through the > > test I > > > > added w/ debugger, it falls back after invoking > isCompressionSatisfying > > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + > 0 > > < > > > 1 > > > > returns true. (You can also see this in the System.out logs I added, > in > > > the > > > > branch's GHA run logs). This doesn't seem right to me -- does > > > > isCompressionSatsifying need an extra check to make sure the > > > > dictionary isn't empty? > > > > > > > > Also, thanks, Aaron! I got into this while running some > > micro-benchmarks > > > on > > > > Parquet reads when various dictionary/bloom filter/encoding options > are > > > > configured. Happy to share out when I'm done. > > > > > > > > Best, > > > > Claire > > > > > > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu wrote: > > > > > > > > > Thanks for the investigation! > > > > > > > > > > I think the check below makes sense for a single page: > > > > > @Override > > > > > public boolean isCompressionSatisfying(long rawSize, long > > > encodedSize) > > > > { > > > > > return (encodedSize + dictionaryByteSize) < rawSize; > > > > > } > > > > > > > > > > The problem is that the fallback check is only performed on the > first > > > > page. > > > > > In the first page, all values in that page may be distinct, so it > > will > > > > > unlikely > > > > > pass the isCompressionSatisfying check. > > > > > > > > > > Best, > > > > > Gang > > > > > > > > > > > > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett > > > > > wrote: > > > > > > > > > > > Claire, thank you for your research and examples on this topic, > > I've > > > > > > learned a lot. My hunch is that your change would be a good one, > > but > > > > I'm > > > > > > not an expert (and more to the point, not a committer). I'm > > looking > > > > > > forward to learning more as this discussion continues. > > > > > > > > > > > > Thank you again, Aaron > > > > > > > > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty < > > > > > claire.d.mcgi...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > >
[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats
[ https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767635#comment-17767635 ] ASF GitHub Bot commented on PARQUET-2352: - raunaqmorarka commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r128796 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I want to assume that all existing stats are truncated. Going forward we should have a flag to explicitly indicate whether or not truncation took place and applications should perform aggregation pushdown only if that flag is found to indicate no truncation. But I think adding that flag can be tackled separately as a follow-up. > Update parquet format spec to allow truncation of row group min/max stats > - > > Key: PARQUET-2352 > URL: https://issues.apache.org/jira/browse/PARQUET-2352 > Project: Parquet > Issue Type: Improvement >Reporter: Raunaq Morarka >Priority: Major > > Column index stats are explicitly allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958] > However, it seems row group min/max stats are not allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219] > although it is not explicitly clarified like in the column index case. This > forces implementations to either drop min/max row group stats for columns > with long strings and miss opportunities for filtering row groups or > seemingly deviate from spec by truncating min/max row group stats. > https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to > parquet-mr which allows users to deviate from spec and configure truncation > of min/max row group stats. Unfortunately, there is no way for readers to > detect whether truncation took place. > Since the possibility of truncation exists and is not possible to explicitly > detect, attempts to pushdown min/max aggregation to parquet have avoided > implementing it for string columns (e.g. > https://issues.apache.org/jira/browse/SPARK-36645) > Given the above situation, the spec should be updated to allow truncation of > min/max row group stats. This would align the spec with current reality that > string column min/max row group stats could be truncated. > Additionally, a flag could be added to the stats to specify whether min/max > stats are truncated. Reader implementations could then safely implement > min/max aggregation pushdown to strings for new data going forward by > checking the value of this flag. When the flag is not found on existing data > then it could be assumed that the data could be truncated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] raunaqmorarka commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics
raunaqmorarka commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r128796 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I want to assume that all existing stats are truncated. Going forward we should have a flag to explicitly indicate whether or not truncation took place and applications should perform aggregation pushdown only if that flag is found to indicate no truncation. But I think adding that flag can be tackled separately as a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2352) Update parquet format spec to allow truncation of row group min/max stats
[ https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767618#comment-17767618 ] ASF GitHub Bot commented on PARQUET-2352: - wgtmac commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333286445 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: IIRC, you want to allow truncation by adding a flag to indicate whether the min_value/max_value is truncated or not, right? > Update parquet format spec to allow truncation of row group min/max stats > - > > Key: PARQUET-2352 > URL: https://issues.apache.org/jira/browse/PARQUET-2352 > Project: Parquet > Issue Type: Improvement >Reporter: Raunaq Morarka >Priority: Major > > Column index stats are explicitly allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958] > However, it seems row group min/max stats are not allowed to be truncated > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219] > although it is not explicitly clarified like in the column index case. This > forces implementations to either drop min/max row group stats for columns > with long strings and miss opportunities for filtering row groups or > seemingly deviate from spec by truncating min/max row group stats. > https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to > parquet-mr which allows users to deviate from spec and configure truncation > of min/max row group stats. Unfortunately, there is no way for readers to > detect whether truncation took place. > Since the possibility of truncation exists and is not possible to explicitly > detect, attempts to pushdown min/max aggregation to parquet have avoided > implementing it for string columns (e.g. > https://issues.apache.org/jira/browse/SPARK-36645) > Given the above situation, the spec should be updated to allow truncation of > min/max row group stats. This would align the spec with current reality that > string column min/max row group stats could be truncated. > Additionally, a flag could be added to the stats to specify whether min/max > stats are truncated. Reader implementations could then safely implement > min/max aggregation pushdown to strings for new data going forward by > checking the value of this flag. When the flag is not found on existing data > then it could be assumed that the data could be truncated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] wgtmac commented on a diff in pull request #216: PARQUET-2352: Allow truncation of row group min_values/max_value statistics
wgtmac commented on code in PR #216: URL: https://github.com/apache/parquet-format/pull/216#discussion_r1333286445 ## src/main/thrift/parquet.thrift: ## @@ -216,7 +216,12 @@ struct Statistics { /** count of distinct values occurring */ 4: optional i64 distinct_count; /** -* Min and max values for the column, determined by its ColumnOrder. +* lower and upper bound values for the column, determined by its ColumnOrder. Review Comment: IIRC, you want to allow truncation by adding a flag to indicate whether the min_value/max_value is truncated or not, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: Parquet dictionary size limits?
The current implementation only checks the first page, which is vulnerable in many cases. I think your suggestion makes sense. However, there is no one-fit-for-all solution. How about simply adding a flag to enforce dictionary encoding to a specific column? On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty wrote: > I think I figured it out! The dictionaryByteSize == 0 was a red herring; I > was looking at an IntegerDictionaryValuesWriter for an empty column rather > than my high-cardinality column. Your analysis of the situation was > right--it was just that in the first page, there weren't enough distinct > values to pass the check. > > I wonder if we could maybe make this value configurable per-column? Either: > > - A desired ratio of distinct values / total values, on a scale of 0-1.0 > - Number of pages to check for compression before falling back > > Let me know what you think! > > Best, > Claire > > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu wrote: > > > I don't understand why you get encodedSize == 1, dictionaryByteSize == 0 > > and rawSize == 0 in the first page check. It seems that the page does not > > have any meaning values. Could you please check how many values are > > written before the page check? > > > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty < > > claire.d.mcgi...@gmail.com> > > wrote: > > > > > Hey Gang, > > > > > > Thanks for the followup! I see what you're saying where it's sometimes > > just > > > bad luck with what ends up in the first page. The intuition seems like > a > > > larger page size should produce a better encoding in this case... I > > updated > > > my branch > > > < > > > > > > https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 > > > > > > > to > > > add a test with a page size/dict page size of 10MB and am seeing the > same > > > failure, though. > > > > > > Something seems kind of odd actually -- when I stepped through the > test I > > > added w/ debugger, it falls back after invoking isCompressionSatisfying > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 > < > > 1 > > > returns true. (You can also see this in the System.out logs I added, in > > the > > > branch's GHA run logs). This doesn't seem right to me -- does > > > isCompressionSatsifying need an extra check to make sure the > > > dictionary isn't empty? > > > > > > Also, thanks, Aaron! I got into this while running some > micro-benchmarks > > on > > > Parquet reads when various dictionary/bloom filter/encoding options are > > > configured. Happy to share out when I'm done. > > > > > > Best, > > > Claire > > > > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu wrote: > > > > > > > Thanks for the investigation! > > > > > > > > I think the check below makes sense for a single page: > > > > @Override > > > > public boolean isCompressionSatisfying(long rawSize, long > > encodedSize) > > > { > > > > return (encodedSize + dictionaryByteSize) < rawSize; > > > > } > > > > > > > > The problem is that the fallback check is only performed on the first > > > page. > > > > In the first page, all values in that page may be distinct, so it > will > > > > unlikely > > > > pass the isCompressionSatisfying check. > > > > > > > > Best, > > > > Gang > > > > > > > > > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett > > > > wrote: > > > > > > > > > Claire, thank you for your research and examples on this topic, > I've > > > > > learned a lot. My hunch is that your change would be a good one, > but > > > I'm > > > > > not an expert (and more to the point, not a committer). I'm > looking > > > > > forward to learning more as this discussion continues. > > > > > > > > > > Thank you again, Aaron > > > > > > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty < > > > > claire.d.mcgi...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > I created a quick branch > > > > > > < > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 > > > > > > > > > > > > > to reproduce what I'm seeing -- the test shows that an Int column > > > with > > > > > > cardinality 100 successfully results in a dict encoding, but an > int > > > > > column > > > > > > with cardinality 10,000 falls back and doesn't create a dict > > > encoding. > > > > > This > > > > > > seems like a low threshold given the 1MB dictionary page size, > so I > > > > just > > > > > > wanted to check whether this is expected or not :) > > > > > > > > > > > > Best, > > > > > > Claire > > > > > > > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty < > > > > > claire.d.mcgi...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi, just wanted to follow up on this! > > > > > > > > > > > > > > I ran a debugger to find out why my column wasn't ending up > with > > a > > > > > > > dictionary encoding and it turns out that even though > > > > > > >
[GitHub] [parquet-mr] Fokko merged pull request #1147: Bump jmh.version from 1.21 to 1.37
Fokko merged PR #1147: URL: https://github.com/apache/parquet-mr/pull/1147 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2313) Bump actions/setup-java from 1 to 3
[ https://issues.apache.org/jira/browse/PARQUET-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767530#comment-17767530 ] ASF GitHub Bot commented on PARQUET-2313: - Fokko merged PR #203: URL: https://github.com/apache/parquet-format/pull/203 > Bump actions/setup-java from 1 to 3 > --- > > Key: PARQUET-2313 > URL: https://issues.apache.org/jira/browse/PARQUET-2313 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Gang Wu >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] Fokko merged pull request #203: PARQUET-2313: Bump actions/setup-java from 1 to 3
Fokko merged PR #203: URL: https://github.com/apache/parquet-format/pull/203 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2268) Bump Thrift to 0.18.1
[ https://issues.apache.org/jira/browse/PARQUET-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767526#comment-17767526 ] ASF GitHub Bot commented on PARQUET-2268: - Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1 URL: https://github.com/apache/parquet-mr/pull/1054 > Bump Thrift to 0.18.1 > - > > Key: PARQUET-2268 > URL: https://issues.apache.org/jira/browse/PARQUET-2268 > Project: Parquet > Issue Type: Improvement >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1
Fokko closed pull request #1054: PARQUET-2268: Bump Thrift to 0.18.1 URL: https://github.com/apache/parquet-mr/pull/1054 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] Fokko opened a new pull request, #1148: Remove PR limit
Fokko opened a new pull request, #1148: URL: https://github.com/apache/parquet-mr/pull/1148 Dependabot is effectively not working because 5 PRs have been open for ages. Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] dependabot[bot] commented on pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4
dependabot[bot] commented on PR #1110: URL: https://github.com/apache/parquet-mr/pull/1110#issuecomment-1729435128 OK, I won't notify you about version 3.3.x again, unless you re-open this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] dependabot[bot] closed pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4
dependabot[bot] closed pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4 URL: https://github.com/apache/parquet-mr/pull/1110 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] Fokko commented on pull request #1110: Bump scalatest_2.12 from 3.0.1 to 3.3.0-SNAP4
Fokko commented on PR #1110: URL: https://github.com/apache/parquet-mr/pull/1110#issuecomment-1729435056 @dependabot ignore this minor version -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] Fokko commented on pull request #1068: Bump elephant-bird.version from 4.4 to 4.17
Fokko commented on PR #1068: URL: https://github.com/apache/parquet-mr/pull/1068#issuecomment-1729434254 @lukasnalezenec what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] dependabot[bot] closed pull request #1062: Bump jmh.version from 1.21 to 1.36
dependabot[bot] closed pull request #1062: Bump jmh.version from 1.21 to 1.36 URL: https://github.com/apache/parquet-mr/pull/1062 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] dependabot[bot] commented on pull request #1062: Bump jmh.version from 1.21 to 1.36
dependabot[bot] commented on PR #1062: URL: https://github.com/apache/parquet-mr/pull/1062#issuecomment-1729427261 Superseded by #1147. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] dependabot[bot] opened a new pull request, #1147: Bump jmh.version from 1.21 to 1.37
dependabot[bot] opened a new pull request, #1147: URL: https://github.com/apache/parquet-mr/pull/1147 Bumps `jmh.version` from 1.21 to 1.37. Updates `org.openjdk.jmh:jmh-core` from 1.21 to 1.37 Commits https://github.com/openjdk/jmh/commit/2effa2c8310e1d3ad03c8ee02024edca9252b46a;>2effa2c JMH v1.37. https://github.com/openjdk/jmh/commit/09c78d5d0752ffc409e64ca4cabe9dc7b96704d7;>09c78d5 7903508: JMH: Remove the Unicode dot prefix from secondary results https://github.com/openjdk/jmh/commit/843f64123bc25b0402e78b863999dd7c69adb309;>843f641 7903510: JMH: Add core performance checking tests https://github.com/openjdk/jmh/commit/8bc325b2de5728077a62e9e32ea3ff15189636f8;>8bc325b 7903511: JMH: Add score stability performance tests https://github.com/openjdk/jmh/commit/6b09724579b8ae58554e15a1bc5badb543da5bd2;>6b09724 7903450: JMH: Improve -prof perfnorm accuracy with robust estimations https://github.com/openjdk/jmh/commit/d88f901b2a50539e276aa409b5d7ce1eb3e1bfc9;>d88f901 7903504: JMH: Fix new Sonar warnings https://github.com/openjdk/jmh/commit/47f651b72d05c2c335f8ced5ed33f2fb0dd26720;>47f651b 7903498: JMH: Reset worker interrupt status after iteration https://github.com/openjdk/jmh/commit/482561a2be24e47f1c3a855b3ce69f56130ec57e;>482561a 7903492: JMH: Infrastructure code should yield occasionally for virtual execu... https://github.com/openjdk/jmh/commit/9a9755714746f76bfaaa067e777457ce6ef64bf1;>9a97557 7903490: JMH: The interrupt to time-outing benchmark can be delivered to work... https://github.com/openjdk/jmh/commit/bf8db38250af9435a13dde822df22c3aee6dc2bb;>bf8db38 7903487: JMH: Make sure JMH profilers work on all tested configurations Additional commits viewable in https://github.com/openjdk/jmh/compare/1.21...1.37;>compare view Updates `org.openjdk.jmh:jmh-generator-annprocess` from 1.21 to 1.37 Commits https://github.com/openjdk/jmh/commit/2effa2c8310e1d3ad03c8ee02024edca9252b46a;>2effa2c JMH v1.37. https://github.com/openjdk/jmh/commit/09c78d5d0752ffc409e64ca4cabe9dc7b96704d7;>09c78d5 7903508: JMH: Remove the Unicode dot prefix from secondary results https://github.com/openjdk/jmh/commit/843f64123bc25b0402e78b863999dd7c69adb309;>843f641 7903510: JMH: Add core performance checking tests https://github.com/openjdk/jmh/commit/8bc325b2de5728077a62e9e32ea3ff15189636f8;>8bc325b 7903511: JMH: Add score stability performance tests https://github.com/openjdk/jmh/commit/6b09724579b8ae58554e15a1bc5badb543da5bd2;>6b09724 7903450: JMH: Improve -prof perfnorm accuracy with robust estimations https://github.com/openjdk/jmh/commit/d88f901b2a50539e276aa409b5d7ce1eb3e1bfc9;>d88f901 7903504: JMH: Fix new Sonar warnings https://github.com/openjdk/jmh/commit/47f651b72d05c2c335f8ced5ed33f2fb0dd26720;>47f651b 7903498: JMH: Reset worker interrupt status after iteration https://github.com/openjdk/jmh/commit/482561a2be24e47f1c3a855b3ce69f56130ec57e;>482561a 7903492: JMH: Infrastructure code should yield occasionally for virtual execu... https://github.com/openjdk/jmh/commit/9a9755714746f76bfaaa067e777457ce6ef64bf1;>9a97557 7903490: JMH: The interrupt to time-outing benchmark can be delivered to work... https://github.com/openjdk/jmh/commit/bf8db38250af9435a13dde822df22c3aee6dc2bb;>bf8db38 7903487: JMH: Make sure JMH profilers work on all tested configurations Additional commits viewable in https://github.com/openjdk/jmh/compare/1.21...1.37;>compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it
[GitHub] [parquet-mr] Fokko commented on pull request #1062: Bump jmh.version from 1.21 to 1.36
Fokko commented on PR #1062: URL: https://github.com/apache/parquet-mr/pull/1062#issuecomment-1729426235 @dependabot rebase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] Fokko opened a new pull request, #1146: Modest refactor of ParquetFileWriter
Fokko opened a new pull request, #1146: URL: https://github.com/apache/parquet-mr/pull/1146 IDEA was lighting up as a christmas tree :) Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] Fokko opened a new pull request, #1145: Use try-with-resources
Fokko opened a new pull request, #1145: URL: https://github.com/apache/parquet-mr/pull/1145 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] Fokko opened a new pull request, #1144: Remove old setting for cascading
Fokko opened a new pull request, #1144: URL: https://github.com/apache/parquet-mr/pull/1144 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org