Chiran Ravani created HIVE-24245:
------------------------------------

             Summary: Vectorized PTF with count and distinct over partition 
producing incorrect results.
                 Key: HIVE-24245
                 URL: https://issues.apache.org/jira/browse/HIVE-24245
             Project: Hive
          Issue Type: Bug
          Components: Hive, PTF-Windowing, Vectorization
    Affects Versions: 3.1.2, 3.1.0
            Reporter: Chiran Ravani


Vectorized PTF for count and distinct over partition is broken. It produces 
incorrect results.
Below is the test case.

{code}
CREATE TABLE bigd781b_new (
  id int,
  txt1 string,
  txt2 string,
  cda_date int,
  cda_job_name varchar(12));

INSERT INTO bigd781b_new VALUES 
  (1,'2010005759','7164335675012038',20200528,'load1'),
  (2,'2010005759','7164335675012038',20200528,'load2');
{code}

Running below query produces incorrect results

{code}
SELECT
    txt1,
    txt2,
    count(distinct txt1) over(partition by txt1) as n,
    count(distinct txt2) over(partition by txt2) as m
FROM bigd781b_new
WHERE cda_date = 20200528 and ( txt2 = '7164335675012038');
{code}

as below.

{code}
+-------------+-------------------+----+----+
|    txt1     |       txt2        | n  | m  |
+-------------+-------------------+----+----+
| 2010005759  | 7164335675012038  | 2  | 2  |
| 2010005759  | 7164335675012038  | 2  | 2  |
+-------------+-------------------+----+----+
{code}

While the correct output would be

{code}
+-------------+-------------------+----+----+
|    txt1     |       txt2        | n  | m  |
+-------------+-------------------+----+----+
| 2010005759  | 7164335675012038  | 1  | 1  |
| 2010005759  | 7164335675012038  | 1  | 1  |
+-------------+-------------------+----+----+
{code}


The problem does not appear after setting below property
set hive.vectorized.execution.ptf.enabled=false;




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to