[ 
https://issues.apache.org/jira/browse/IMPALA-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-8431:
------------------------------------
    Description: 
https://github.com/apache/impala/blob/5fa076e95cfbfcc044dc14cbb20af825936af82a/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L1698

computeMinScalarColumnMemReservation() uses stat avg_size to estimate the 
memory needed for a value during scanning, but this does not contain the 4 byte 
/ value length field used in plain encoding, which can dominate columns with 
very short strings. (compression can probably negate this affect)

In case of dict decoding estimation:
- this 4 byte/NDV should be also added, as the dictionary itself is also plain 
encoded
- the backend used + 12 byte/NDV for the StringValues used as indirection in 
the dictionary, but I am not sure if this should be added to the reservation
- a more pessimistic estimation would use max_size instead of avg_size  for 
dictionary entries, as it is possible that the majority of distinct values are 
long, but the short ones are much more frequent, which makes the avg_size small

Another small underestimation, that NULL values are ignored. NULLs (=def 
levels) could be  added as 1 bit/value.

  was:
https://github.com/apache/impala/blob/5fa076e95cfbfcc044dc14cbb20af825936af82a/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L1698

computeMinScalarColumnMemReservation() uses stat avg_size to estimate the 
memory needed for a value during scanning, but this does not contain the 4 byte 
/ value length field used in plain encoding, which can dominate columns with 
very short strings. (compression can probably negate this affect)

In case of dict decoding estimation:
- this 4 byte/NDV should be also added, as the dictionary itself is also plain 
encoded
- + 12 byte/NDV is used for the StringValues used as indirection in the 
dictionary, but I am not sure if this should be added to the reservation
- a more pessimistic estimation would use max_size instead of avg_size  for 
dictionary entries, as it is possible that the majority of distinct values are 
long, but the short ones are much more frequent, which makes the avg_size small

Another small underestimation, that NULL values are ignored. NULLs (=def 
levels) could be  added as 1 bit/value.


> Parquet STRING column memory reservation seems underestimated
> -------------------------------------------------------------
>
>                 Key: IMPALA-8431
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8431
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.2.0
>            Reporter: Csaba Ringhofer
>            Priority: Minor
>              Labels: parquet, reservation
>
> https://github.com/apache/impala/blob/5fa076e95cfbfcc044dc14cbb20af825936af82a/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L1698
> computeMinScalarColumnMemReservation() uses stat avg_size to estimate the 
> memory needed for a value during scanning, but this does not contain the 4 
> byte / value length field used in plain encoding, which can dominate columns 
> with very short strings. (compression can probably negate this affect)
> In case of dict decoding estimation:
> - this 4 byte/NDV should be also added, as the dictionary itself is also 
> plain encoded
> - the backend used + 12 byte/NDV for the StringValues used as indirection in 
> the dictionary, but I am not sure if this should be added to the reservation
> - a more pessimistic estimation would use max_size instead of avg_size  for 
> dictionary entries, as it is possible that the majority of distinct values 
> are long, but the short ones are much more frequent, which makes the avg_size 
> small
> Another small underestimation, that NULL values are ignored. NULLs (=def 
> levels) could be  added as 1 bit/value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to