[ 
https://issues.apache.org/jira/browse/HIVE-29122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa resolved HIVE-29122.
-----------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

Merged to master. Thanks [~soumyakanti.das] for the patch and [~dkuzmenko] for 
the review.

> Vectorization - Support IGNORE NULLS for FIRST_VALUE and LAST_VALUE
> -------------------------------------------------------------------
>
>                 Key: HIVE-29122
>                 URL: https://issues.apache.org/jira/browse/HIVE-29122
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 4.2.0
>            Reporter: Soumyakanti Das
>            Assignee: Soumyakanti Das
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> Currently we get wrong results for FIRST_VALUE and LAST_VALUE functions with 
> IGNORE NULLS when vectorization is enabled.
> Consider this table:
> {noformat}
> CREATE TABLE window_double_test (
>   id INT,
>   double_col DOUBLE
> );
> INSERT INTO window_double_test VALUES
>   (1, NULL),
>   (2, NULL),
>   (1, NULL),
>   (1, NULL),
>   (2, 25.5),
>   (3, NULL),
>   (6, 65.5),
>   (2, NULL),
>   (3, 32.5),
>   (4, NULL),
>   (3, 30.5),
>   (5, 50.5),
>   (4, 42.3),
>   (5, NULL),
>   (6, 65.2),
>   (7, NULL),
>   (NULL, 80.5),
>   (NULL, NULL);{noformat}
> Non-vectorized:
> {noformat}
> set hive.vectorized.execution.enabled=false;
> SELECT id, double_col,
>   FIRST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS first_double,
>   LAST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS last_double
> FROM window_double_test;
> id  double_col first_double last_double
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 2 NULL 25.5 25.5
> 2 25.5 25.5 25.5
> 2 NULL 25.5 25.5
> 3 NULL 25.5 32.5
> 3 30.5 25.5 32.5
> 3 32.5 25.5 32.5
> 4 NULL 25.5 42.3
> 4 42.3 25.5 42.3
> 5 50.5 25.5 50.5
> 5 NULL 25.5 50.5
> 6 65.2 25.5 65.5
> 6 65.5 25.5 65.5
> 7 NULL 25.5 65.5
> NULL  NULL 25.5 80.5
> NULL  80.5 25.5 80.5{noformat}
> Vectorized:
> {noformat}
> set hive.vectorized.execution.enabled=true;
> SELECT id, double_col,
>   FIRST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS first_double,
>   LAST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS last_double
> FROM window_double_test;
> id  double_col first_double last_double
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 2 NULL NULL NULL
> 2 25.5 NULL NULL
> 2 NULL NULL NULL
> 3 NULL NULL 32.5
> 3 30.5 NULL 32.5
> 3 32.5 NULL 32.5
> 4 NULL NULL 42.3
> 4 42.3 NULL 42.3
> 5 50.5 NULL NULL
> 5 NULL NULL NULL
> 6 65.2 NULL 65.5
> 6 65.5 NULL 65.5
> 7 NULL NULL NULL
> NULL  NULL NULL 80.5
> NULL  80.5 NULL 80.5{noformat}
> We can see the differences for id = 2, for example. Similar results can also 
> be seen for int and decimal columns too. We don't see the same for a String 
> column as only LONG, DECIMAL, and DOUBLE are supported, as can be seen in 
> Vectorizer#validatePTFOperator.
> There are other scenarios where it bypasses the Vectorizer, for example, if 
> we explicitly use ROWS instead of the default RANGE, etc. Anything that skips 
> vectorization produces correct results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to