[
https://issues.apache.org/jira/browse/IMPALA-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved IMPALA-6678.
-----------------------------------
Resolution: Fixed
Fix Version/s: Impala 3.1.0
Impala 2.13.0
> Better estimate of per-column compressed data size for low-NDV columns.
> -----------------------------------------------------------------------
>
> Key: IMPALA-6678
> URL: https://issues.apache.org/jira/browse/IMPALA-6678
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Affects Versions: Not Applicable
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
> Labels: resource-management
> Fix For: Impala 2.13.0, Impala 3.1.0
>
>
> In the previous IMPALA-4835 patch, we assumed that the "ideal" memory per
> Parquet column was 3 * 8MB, except when the total size of the file capped the
> total amount of memory we might use. This is often an overestimate,
> particular for smaller files, files with large numbers of columns, and highly
> compressible data.
> We could do something smarter for Parquet given file sizes, per-partition row
> count, and column NDV. We can estimate row count per file by dividing the row
> count by the file size and estimate bytes per value with two methods:
> * For fixed width types, estimating bytes per value based on the type width.
> We don't know what the physical parquet type is necessarily, but it seems
> reasonable to estimate based on the type declared in the table.
> * log2(ndv) / 8, assuming that dictionary compression or general-purpose
> compression will kick in.
> See
> https://docs.google.com/document/d/1kR0zfevNNUJom3sH1XmposacVZ-QALan7NSwnR5CkSA/edit#heading=h.a2b8e8h5a6en
> for some analysis.
> I looked at encoded lineitem data and saw that many of the scanned columns
> were 3-4MB in size and that we could have estimated an ideal size < 24MB per
> column based on the above heuristics.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)