[jira] [Commented] (PARQUET-2016) Reference column_order field from column indexes

ASF GitHub Bot (Jira) Thu, 08 Apr 2021 07:21:07 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317227#comment-17317227
 ]


ASF GitHub Bot commented on PARQUET-2016:
-----------------------------------------

gszadovszky commented on a change in pull request #173:
URL: https://github.com/apache/parquet-format/pull/173#discussion_r609753323



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -941,13 +941,14 @@ struct ColumnIndex {
   1: required list<bool> null_pages
 
   /**
-   * Two lists containing lower and upper bounds for the values of each page.
-   * These may be the actual minimum and maximum values found on a page, but
-   * can also be (more compact) values that do not exist on a page. For
-   * example, instead of storing ""Blart Versenwald III", a writer may set
-   * min_values[i]="B", max_values[i]="C". Such more compact values must still
-   * be valid values within the column's logical type. Readers must make sure
-   * that list entries are populated before using them by inspecting 
null_pages.
+   * Two lists containing lower and upper bounds for the values of each page
+   * determined by the ColumnOrder of the column. These may be the actual
+   * minimum and maximum values found on a page, but can also be (more compact)
+   * values that do not exist on a page. For example, instead of storing 
""Blart
+   * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".

Review comment:
       There is a bit more info about the possible truncation in the [column 
index spec](https://github.com/apache/parquet-format/blob/master/PageIndex.md) 
(search for "truncate"). The only existing type that would allow such 
truncation is BINARY<STRING> but I guess the spec did not want to be too tight 
for potential later types.
   Anyway, parquet-mr has implemented a truncation mechanism for UTF8 strings 
and the default length above we truncate is 64.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Reference column_order field from column indexes
> ------------------------------------------------
>
>                 Key: PARQUET-2016
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2016
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>
> We have created the field column_order to specify the ordering of a primitive 
> type. This is used for the row group level statistics but we never referenced 
> this from the column indexes feature while in both cases we heavily rely on 
> the ordering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-2016) Reference column_order field from column indexes

Reply via email to