[
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002983#comment-15002983
]
Sergey Shelukhin commented on HIVE-11583:
-----------------------------------------
You could generate it in the test by repeatedly cross joining. Or does the file
have to be in a specific form that is not reproducible by the queries?
> When PTF is used over a large partitions result could be corrupted
> ------------------------------------------------------------------
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
> Issue Type: Bug
> Components: PTF-Windowing
> Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
> Reporter: Illya Yalovyy
> Assignee: Illya Yalovyy
> Priority: Critical
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset:
> Window has 50001 record (2 blocks on disk and 1 block in memory)
> Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually
> loaded. The second split gets missed. The total count of the result dataset
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
> id INT,
> key STRING,
> grp STRING,
> value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A 25000
> -- B 20000
> -- C 5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> --
> -- A 34296
> -- B 15704
> -- C 1
> ---
> {code}
> Counts by 'grp' are incorrect!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)