Csaba Ringhofer created IMPALA-8449:
---------------------------------------

             Summary: Avoid Parquet pages with too many rows + try to make them 
aligned
                 Key: IMPALA-8449
                 URL: https://issues.apache.org/jira/browse/IMPALA-8449
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


Currently Impala has a limit for Parquet data page size, but not the number of 
rows in the page.  This means that if the page can be encoded efficiently with 
RLE, then the any number of rows can fit into a page. This is an issue for 
column indexes, because ordered columns (which are very good candidates for 
min/max filtering) with low enough NDV to fit into the dictionary will be 
encoded "too well", making the per page index too coarse grained.

Parquet-mr choose the approach of adding a configurable "max row count in page" 
(20000 by default): PARQUET-1414. This would work for Impala too, and is 
relatively simple to implement, but I think that it is still a sub-optimal 
solution for column indexes, as it doesn't make every page aligned, as some 
pages may hit the max size limit first, leading to less rows than 20000, which 
makes all subsequent pages in the column chunk non-aligned. The max size limit 
seems important for string columns, as long strings could lead to very large 
pages otherwise. An alternative algorithm is to start a new page at every Nth 
row regardless of the number of rows in the current page. This would result in 
the same layout as the previous approach in case of columns where pages always 
hit the max row count limit before the max size limit, but for other columns, 
alignment would be reestablished after every Nth row.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to