[ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:
--------------------------------
    Attachment: HIVE-22579.01.branch-2.patch

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22579
>                 URL: https://issues.apache.org/jira/browse/HIVE-22579
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: HIVE-22579.01.branch-2.patch, 
> HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013
> -rw-r--r--   3 hive hadoop        353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012
> -rw-r--r--   3 hive hadoop       1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043
> -rwxrwxrwx   3 hive hadoop       1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
>  start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
> hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
> [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 
> 16 stmtIds: [0] }]]
> 2019-12-01T16:24:55,807  INFO [TezTR-127843_16_30_0_425_0 
> (1575040127843_0016_30_00_000425_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:55,813  INFO [TezTR-127843_16_30_0_425_0 
> (1575040127843_0016_30_00_000425_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
>  start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
> hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
> [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 
> 16 stmtIds: [0] }]]
> {code}
> seems like this issue doesn't affect AcidV2, as getSplits() returns an empty 
> collection or throws an exception in case of unexpected deltas (which was the 
> case here, where deltas was not unexpected):
> https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to