[
https://issues.apache.org/jira/browse/HIVE-29122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krisztian Kasa resolved HIVE-29122.
-----------------------------------
Fix Version/s: 4.2.0
Resolution: Fixed
Merged to master. Thanks [~soumyakanti.das] for the patch and [~dkuzmenko] for
the review.
> Vectorization - Support IGNORE NULLS for FIRST_VALUE and LAST_VALUE
> -------------------------------------------------------------------
>
> Key: HIVE-29122
> URL: https://issues.apache.org/jira/browse/HIVE-29122
> Project: Hive
> Issue Type: Bug
> Components: Vectorization
> Affects Versions: 4.2.0
> Reporter: Soumyakanti Das
> Assignee: Soumyakanti Das
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> Currently we get wrong results for FIRST_VALUE and LAST_VALUE functions with
> IGNORE NULLS when vectorization is enabled.
> Consider this table:
> {noformat}
> CREATE TABLE window_double_test (
> id INT,
> double_col DOUBLE
> );
> INSERT INTO window_double_test VALUES
> (1, NULL),
> (2, NULL),
> (1, NULL),
> (1, NULL),
> (2, 25.5),
> (3, NULL),
> (6, 65.5),
> (2, NULL),
> (3, 32.5),
> (4, NULL),
> (3, 30.5),
> (5, 50.5),
> (4, 42.3),
> (5, NULL),
> (6, 65.2),
> (7, NULL),
> (NULL, 80.5),
> (NULL, NULL);{noformat}
> Non-vectorized:
> {noformat}
> set hive.vectorized.execution.enabled=false;
> SELECT id, double_col,
> FIRST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS first_double,
> LAST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS last_double
> FROM window_double_test;
> id double_col first_double last_double
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 2 NULL 25.5 25.5
> 2 25.5 25.5 25.5
> 2 NULL 25.5 25.5
> 3 NULL 25.5 32.5
> 3 30.5 25.5 32.5
> 3 32.5 25.5 32.5
> 4 NULL 25.5 42.3
> 4 42.3 25.5 42.3
> 5 50.5 25.5 50.5
> 5 NULL 25.5 50.5
> 6 65.2 25.5 65.5
> 6 65.5 25.5 65.5
> 7 NULL 25.5 65.5
> NULL NULL 25.5 80.5
> NULL 80.5 25.5 80.5{noformat}
> Vectorized:
> {noformat}
> set hive.vectorized.execution.enabled=true;
> SELECT id, double_col,
> FIRST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS first_double,
> LAST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS last_double
> FROM window_double_test;
> id double_col first_double last_double
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 1 NULL NULL NULL
> 2 NULL NULL NULL
> 2 25.5 NULL NULL
> 2 NULL NULL NULL
> 3 NULL NULL 32.5
> 3 30.5 NULL 32.5
> 3 32.5 NULL 32.5
> 4 NULL NULL 42.3
> 4 42.3 NULL 42.3
> 5 50.5 NULL NULL
> 5 NULL NULL NULL
> 6 65.2 NULL 65.5
> 6 65.5 NULL 65.5
> 7 NULL NULL NULL
> NULL NULL NULL 80.5
> NULL 80.5 NULL 80.5{noformat}
> We can see the differences for id = 2, for example. Similar results can also
> be seen for int and decimal columns too. We don't see the same for a String
> column as only LONG, DECIMAL, and DOUBLE are supported, as can be seen in
> Vectorizer#validatePTFOperator.
> There are other scenarios where it bypasses the Vectorizer, for example, if
> we explicitly use ROWS instead of the default RANGE, etc. Anything that skips
> vectorization produces correct results.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)