[
https://issues.apache.org/jira/browse/HIVE-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko updated HIVE-27497:
----------------------------------
Priority: Major (was: Critical)
> Vector groupby operation produces incorrect result when malicious
> configuration is intentionally set
> ----------------------------------------------------------------------------------------------------
>
> Key: HIVE-27497
> URL: https://issues.apache.org/jira/browse/HIVE-27497
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Reporter: ConfX
> Priority: Major
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Hive VectorGroupByOperator produces incorrect groupby results when
> configuration parameter hive.vectorized.groupby.maxentries and
> hive.groupby.mapaggr.checkinterval are set to some bad value, for example,
> negative value. When Hive first in hash mode processing aggregation, the
> malicious configuration can forcibly change processing mode to streaming
> mode, and the cols in hash mode will be flushed out and not aggregate with
> the cols in streaming mode, which makes the groupby result incorrect.
> This issue is very severe because an attacker can control the value of these
> 2 parameters to control the malicious behavior; also there should have some
> value checkers for such parameters, at least a negative value checker for
> these 2 parameters.
> h2. Buggy code:
> In hash mode, the vector groupby operator checks hash efficiency at the end
> of each processBatch.
> {code:java}
> @Override
> public void processBatch(VectorizedRowBatch batch) throws HiveException {
> ...
> if (this instanceof ProcessingModeHashAggregate) {
> // Check if we should turn into streaming mode
> ((ProcessingModeHashAggregate)this).checkHashModeEfficiency();
> }
> }
> {code}
> In checkHashModeEfficiency, there are three if-conditions that control
> whether to switch to streaming mode.
> {code:java}
> private void checkHashModeEfficiency() throws HiveException {
> if (lastModeCheckRowCount > numRowsCompareHashAggr) {
> ...
> final long inputRecords = sumBatchSize;
> final long outputRecords = numEntriesHashTable +
> numFlushedOutEntriesBeforeFinalFlush;
> final float ratio = (outputRecords) / (inputRecords * 1.0f);
> if (ratio > minReductionHashAggr) {
> if (inputRecords > maxHtEntries) {
> flush(true);
> changeToStreamingMode();
> }
> }
> }
> }
> {code}
> Here numRowsCompareHashAggr is get from hive.groupby.mapaggr.checkinterval;
> minReductionHashAggr is get from hive.vectorized.groupby.maxentries. If these
> two configuration parameters are set to small enough values, e.g. negative
> values, then these 2 if statements are very likely to be always true. In this
> way, the flush will be called and changed to streaming mode.
> This mode changing somehow leads to incorrect grouby aggregation.
> Take a look at the following example, we have:
> | type | | | | | | | |
> | int | null | 1 | 1 | null | 2 | 2 | null |
> | string | A | A | A | C | null | null | A |
> | int | null | 2 | 2 | null | 2 | 2 | null |
> | double | 1.0 | 2.0 | 4.0 | 8.0 | 16.0 | 32.0 | 64.0 |
> Now if we want to do a sum, the expected result would be:
> | type | | | | |
> | int | 1 | 2 | null | null |
> | string | A | null | A | C |
> | int | 2 | 2 | null | null |
> | double | 6.0 | 48.0 | 65.0 | 8.0 |
> However, if we intentionally set hive.vectorized.groupby.maxentries to
> -84396285 and hive.groupby.mapaggr.checkinterval to -1152319368, the output
> becomes:
> | type | | | | | | |
> | int | 1 | null | 1 | null | 2 | null |
> | string | A | A | A | C | null | A |
> | int | 2 | null | 2 | null | 2 | null |
> | double | 2.0 | 1.0 | 4.0 | 8.0 | 48.0 | 46.0 |
> This is due to the fact that in the time in processBatch,
> checkHashModeEfficiency do flush and change to streaming, and the first two
> cols do not do the sum operation with the later cols. Later cols do the sum
> operation in streaming mode partially correct.
> h2. How to reproduce:
> The above example is taken from a test in Hive ql module called
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt,
> you can follow the steps to reproduce the bug:
> (1) Set hive.vectorized.groupby.maxentries to -84396285 and
> hive.groupby.mapaggr.checkinterval to -1152319368
> (2) Run test
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt.
> For an easy reproduction, run the reproduce.sh in the attachment.
> You should observe the following failure:
> {code}
> java.lang.AssertionError: [1, A, 2] expected:<6.0> but was:<2.0>
>
>
> at org.junit.Assert.fail(Assert.java:89)
>
>
> at org.junit.Assert.failNotEquals(Assert.java:835)
>
> at org.junit.Assert.assertEquals(Assert.java:120)
>
> at
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$ValueValidator.validate(TestVectorGroupByOperator.java:2773)
> at
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$12.inspectRow(TestVectorGroupByOperator.java:2463)
> at
> org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureOutputOperator.process(FakeCaptureOutputOperator.java:94)
> at
> org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureVectorToRowOutputOperator.process(FakeCaptureVectorToRowOutputOperator.java:162)
> at
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.flushOutput(VectorGroupByOperator.java:1347)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.closeOp(VectorGroupByOperator.java:1355)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:686)
> at
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKey(TestVectorGroupByOperator.java:2479)
> at
> org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKeyIntStringInt(TestVectorGroupByOperator.java:896)
> {code}
> We are happy to provide a patch if this issue is confirmed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)