[jira] [Commented] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

Hive QA (Jira) Mon, 09 Dec 2019 02:57:19 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991486#comment-16991486
 ]


Hive QA commented on HIVE-22579:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12988299/HIVE-22579.01.branch-2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 10742 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries]
 (batchId=227)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avro_tableproperty_optimize]
 (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[explaindenpendencydiffengs]
 (batchId=38)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[table_nonprintable]
 (batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[join_acid_non_acid]
 (batchId=158)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[union_fast_stats]
 (batchId=153)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=155)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[merge_negative_5]
 (batchId=88)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=100)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[explaindenpendencydiffengs]
 (batchId=115)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
 (batchId=117)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=125)
org.apache.hadoop.hive.ql.security.TestStorageBasedMetastoreAuthorizationReads.testReadTableFailure
 (batchId=217)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/19829/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/19829/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-19829/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 13 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12988299 - PreCommit-HIVE-Build

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22579
>                 URL: https://issues.apache.org/jira/browse/HIVE-22579
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: HIVE-22579.01.branch-2.patch, 
> HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013
> -rw-r--r--   3 hive hadoop        353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012
> -rw-r--r--   3 hive hadoop       1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043
> -rwxrwxrwx   3 hive hadoop       1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
>  start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
> hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
> [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 
> 16 stmtIds: [0] }]]
> 2019-12-01T16:24:55,807  INFO [TezTR-127843_16_30_0_425_0 
> (1575040127843_0016_30_00_000425_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:55,813  INFO [TezTR-127843_16_30_0_425_0 
> (1575040127843_0016_30_00_000425_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
>  start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
> hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
> [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 
> 16 stmtIds: [0] }]]
> {code}
> seems like this issue doesn't affect AcidV2, as getSplits() returns an empty 
> collection or throws an exception in case of unexpected deltas (which was the 
> case here, where deltas was not unexpected):
> https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

Reply via email to