[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

Sergey Shelukhin (JIRA) Thu, 12 Nov 2015 13:38:21 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002983#comment-15002983
 ]


Sergey Shelukhin commented on HIVE-11583:
-----------------------------------------

You could generate it in the test by repeatedly cross joining. Or does the file 
have to be in a specific form that is not reproducible by the queries? 

> When PTF is used over a large partitions result could be corrupted
> ------------------------------------------------------------------
>
>                 Key: HIVE-11583
>                 URL: https://issues.apache.org/jira/browse/HIVE-11583
>             Project: Hive
>          Issue Type: Bug
>          Components: PTF-Windowing
>    Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
>         Environment: Hadoop 2.6 + Apache hive built from trunk
>            Reporter: Illya Yalovyy
>            Assignee: Illya Yalovyy
>            Priority: Critical
>             Fix For: 1.3.0, 2.0.0
>
>         Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  20000
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

Reply via email to