[
https://issues.apache.org/jira/browse/HIVE-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612407#comment-17612407
]
John Sherman edited comment on HIVE-26584 at 10/3/22 6:19 PM:
--------------------------------------------------------------
After digging in deeper - You are correct, it is not a concurrent issue. It
just happened to be the easiest way to repro and I mistakenly thought it was
the root of the issue (before we had the containerized ptest framework, test
conflicts were somewhat common iirc).
Here is what is what I think is happening:
1. During PR testing TestMiniLlapLocalCliDriver tests get split into 32
different splits
[https://github.com/apache/hive/blob/master/itests/bin/generate-cli-splits.sh]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L39]
(It codegens 32 new TestMiniLlapLocalCliDriver objects each with split0 -
split32 in the package name)
2. Test assignment for each split is handled via runtime introspection of the
class name:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L43]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/SplitSupport.java#L46]
in my PRs case:
empty_skip_header_footer_aggr.q gets assigned to split-7:
{code:java}
<testcase name="testCliDriver[empty_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver"
time="2.534"/>
{code}
compressed_skip_header_footer_aggr.q gets assigned to split-4:
{code:java}
<testcase name="testCliDriver[compressed_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver"
time="7.242">
{code}
3. All test splits are split across 20 executors (not sure where this lives,
maybe Jenkins scripts)
split-7 and split-4 get assigned to the same "execution split" of 14
{code:java}
split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver.xml
144: <testcase name="testCliDriver[empty_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver"
time="2.534"/>
split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver.xml
165: <testcase name="testCliDriver[compressed_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver"
time="7.242">
{code}
4. empty_skip_header_footer_aggr gets executed before
compressed_skip_header_footer_aggr (this can be seen above in that 144 is
before 165 in the test xml)
5. Both empty_skip_header_footer_aggr and compressed_skip_header_footer_aggr
create external tables with the data copied to the same location(s).
For example these locations get used in both tests:
${system:test.tmp.dir}/testcase1
${system:test.tmp.dir}/testcase2
since each test invocation ends up using the same path and the tmp directory is
not cleaned between tests this is where the conflict occurs.
6. empty_skip_header_footer_aggr includes rmr commands to cleanup the testcase1
and testcase2 directories.
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/empty_skip_header_footer_aggr.q#L6]
compressed_skip_header does not:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/compressed_skip_header_footer_aggr.q#L1]
This also like explains why it is not reproducible via:
{code:java}
mvn test -Dtest=TestMiniLlapLocalCliDriver
-Dqfile=compressed_skip_header_footer_aggr.q,empty_skip_header_footer_aggr.q
{code}
I think the order of the tests when executed this way is always
compressed_skip_header_footer_aggr.q and then empty_skip_header_footer_aggr.q
My fix ends up working because I give a unique location for each tests test
external data files.
I'll likely modify empty_skip_header_footer_aggr.q to remove the rmr's (because
the only thing it really does is to hide this problem) and give all the
files/directories unique names. We could like add a "unique external directory"
variable that is generated per testcase and cleaned up after each one (or some
other solution) but I think that is out of the scope of this ticket.
was (Author: jfs):
After digging in deeper - You are correct, it is not a concurrent issue. It
just happened to be the easiest way to repro and I mistakenly thought it was
the root of the issue (before we had the containerized ptest framework, test
conflicts were somewhat common iirc).
Here is what is what I think is happening:
1. During PR testing TestMiniLlapLocalCliDriver tests get split into 32
different splits
[https://github.com/apache/hive/blob/master/itests/bin/generate-cli-splits.sh]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L39]
(It codegens 32 new TestMiniLlapLocalCliDriver objects each with split0 -
split32 in the package name)
2. Test assignment for each split is handled via runtime introspection of the
class name:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L43]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/SplitSupport.java#L46]
in my PRs case:
empty_skip_header_footer_aggr.q gets assigned to split-7:
{code:java}
<testcase name="testCliDriver[empty_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver"
time="2.534"/>
{code}
compressed_skip_header_footer_aggr.q gets assigned to split-4:
{code:java}
<testcase name="testCliDriver[compressed_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver"
time="7.242">
{code}
3. All test splits are split across 20 executors (not sure where this lives,
maybe Jenkins scripts)
split-7 and split-4 get assigned to the same "execution split" of 14
{code:java}
split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver.xml
144: <testcase name="testCliDriver[empty_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver"
time="2.534"/>
split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver.xml
165: <testcase name="testCliDriver[compressed_skip_header_footer_aggr]"
classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver"
time="7.242">
{code}
4. empty_skip_header_footer_aggr gets executed before
compressed_skip_header_footer_aggr (this can be seen above in that 144 is
before 165 in the test xml)
5. Both empty_skip_header_footer_aggr and compressed_skip_header_footer_aggr
create external tables with the data copied to the same location(s).
For example these locations get used in both tests:
${system:test.tmp.dir}/testcase1
${system:test.tmp.dir}/testcase2
since each test invocation ends up using the same path and the tmp directory is
not cleaned between tests this is where the conflict occurs.
6. empty_skip_header_footer_aggr includes rmr commands to cleanup the testcase1
and testcase2 directories.
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/empty_skip_header_footer_aggr.q#L6]
compressed_skip_header does not:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/compressed_skip_header_footer_aggr.q#L1]
This also like explains why it is not reproducible via:
{code:java}
mvn test -Dtest=TestMiniLlapLocalCliDriver
-Dqfile=compressed_skip_header_footer_aggr.q,empty_skip_header_footer_aggr.q
{code}
I think the order of the tests when executed this way is always
compressed_skip_header_footer_aggr.q and then empty_skip_header_footer_aggr.q
My fix ends up working because I give a unique location for each tests test
external data files.
I'll likely modify empty_skip_header_footer_aggr.q to remove the rmr's (because
the only thing the do is to hide this problem) and give all the
files/directories unique names. We could like add a "unique external directory"
variable that is generated per testcase and cleaned up after each one (or some
other solution) but I think that is out of the scope of this ticket.
> compressed_skip_header_footer_aggr.q is flaky
> ---------------------------------------------
>
> Key: HIVE-26584
> URL: https://issues.apache.org/jira/browse/HIVE-26584
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Affects Versions: 4.0.0-alpha-2
> Reporter: John Sherman
> Assignee: John Sherman
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> One of my PRs compressed_skip_header_footer_aggr.q was failing with
> unexpected diff. Such as:
> {code:java}
> TestMiniLlapLocalCliDriver.testCliDriver:62 Client Execution succeeded but
> contained differences (error code = 1) after executing
> compressed_skip_header_footer_aggr.q
> 69,71c69,70
> < 1 2019-12-31
> < 2 2018-12-31
> < 3 2017-12-31
> ---
> > 2 2019-12-31
> > 3 2019-12-31
> 89d87
> < NULL NULL
> 91c89
> < 2 2018-12-31
> ---
> > 2 2019-12-31
> 100c98
> < 1
> ---
> > 2
> 109c107
> < 1 2019-12-31
> ---
> > 2 2019-12-31
> 127,128c125,126
> < 1 2019-12-31
> < 3 2017-12-31
> ---
> > 2 2019-12-31
> > 3 2019-12-31
> 146a145
> > 2 2019-12-31
> 155c154
> < 1
> ---
> > 2 {code}
> Investigating it, it did not seem to fail when executed locally. Since I
> suspected test interference I searched for the tablenames/directories used
> and discovered empty_skip_header_footer_aggr.q which uses the same table
> names AND external directories.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)