[jira] [Resolved] (IMPALA-6678) Better estimate of per-column compressed data size for low-NDV columns.

Tim Armstrong (JIRA) Mon, 30 Apr 2018 09:19:16 -0700

     [ 
https://issues.apache.org/jira/browse/IMPALA-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Armstrong resolved IMPALA-6678.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 3.1.0
                   Impala 2.13.0

> Better estimate of per-column compressed data size for low-NDV columns.
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-6678
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6678
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Not Applicable
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: resource-management
>             Fix For: Impala 2.13.0, Impala 3.1.0
>
>
> In the previous IMPALA-4835 patch, we assumed that the "ideal" memory per 
> Parquet column was 3 * 8MB, except when the total size of the file capped the 
> total amount of memory we might use. This is often an overestimate, 
> particular for smaller files, files with large numbers of columns, and highly 
> compressible data.
> We could do something smarter for Parquet given file sizes, per-partition row 
> count, and column NDV. We can estimate row count per file by dividing the row 
> count by the file size and estimate bytes per value with two methods:
> * For fixed width types, estimating bytes per value based on the type width. 
> We don't know what the physical parquet type is necessarily, but it seems 
> reasonable to estimate based on the type declared in the table.
> * log2(ndv) / 8, assuming that dictionary compression or general-purpose 
> compression will kick in.
> See 
> https://docs.google.com/document/d/1kR0zfevNNUJom3sH1XmposacVZ-QALan7NSwnR5CkSA/edit#heading=h.a2b8e8h5a6en
>  for some analysis. 
> I looked at encoded lineitem data and saw that many of the scanned columns 
> were 3-4MB in size and that we could have estimated an ideal size < 24MB per 
> column based on the above heuristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (IMPALA-6678) Better estimate of per-column compressed data size for low-NDV columns.

Reply via email to