ConfX created HIVE-27497:
----------------------------
Summary: Vector groupby operation produces incorrect result when
malicious configuration is intentionally set
Key: HIVE-27497
URL: https://issues.apache.org/jira/browse/HIVE-27497
Project: Hive
Issue Type: Bug
Components: Hive
Reporter: ConfX
Attachments: reproduce.sh
h2. What happened:
Hive VectorGroupByOperator produces incorrect groupby results when
configuration parameter hive.vectorized.groupby.maxentries and
hive.groupby.mapaggr.checkinterval are set to some bad value, for example,
negative value. When Hive first in hash mode processing aggregation, the
malicious configuration can forcibly change processing mode to streaming mode,
and the cols in hash mode will be flushed out and not aggregate with the cols
in streaming mode, which makes the groupby result incorrect.
This issue is very severe because an attacker can control the value of these 2
parameters to control the malicious behavior; also there should have some value
checkers for such parameters, at least a negative value checker for these 2
parameters.
h2. Buggy code:
In hash mode, the vector groupby operator checks hash efficiency at the end of
each processBatch.
{code:java}
@Override
public void processBatch(VectorizedRowBatch batch) throws HiveException {
...
if (this instanceof ProcessingModeHashAggregate) {
// Check if we should turn into streaming mode
((ProcessingModeHashAggregate)this).checkHashModeEfficiency();
}
}
{code}
In checkHashModeEfficiency, there are three if-conditions that control whether
to switch to streaming mode.
{code:java}
private void checkHashModeEfficiency() throws HiveException {
if (lastModeCheckRowCount > numRowsCompareHashAggr) {
...
final long inputRecords = sumBatchSize;
final long outputRecords = numEntriesHashTable +
numFlushedOutEntriesBeforeFinalFlush;
final float ratio = (outputRecords) / (inputRecords * 1.0f);
if (ratio > minReductionHashAggr) {
if (inputRecords > maxHtEntries) {
flush(true);
changeToStreamingMode();
}
}
}
}
{code}
Here numRowsCompareHashAggr is get from hive.groupby.mapaggr.checkinterval;
minReductionHashAggr is get from hive.vectorized.groupby.maxentries. If these
two configuration parameters are set to small enough values, e.g. negative
values, then these 2 if statements are very likely to be always true. In this
way, the flush will be called and changed to streaming mode.
This mode changing somehow leads to incorrect grouby aggregation.
Take a look at the following example, we have:
| type | | | | | | | |
| int | null | 1 | 1 | null | 2 | 2 | null |
| string | A | A | A | C | null | null | A |
| int | null | 2 | 2 | null | 2 | 2 | null |
| double | 1.0 | 2.0 | 4.0 | 8.0 | 16.0 | 32.0 | 64.0 |
Now if we want to do a sum, the expected result would be:
| type | | | | |
| int | 1 | 2 | null | null |
| string | A | null | A | C |
| int | 2 | 2 | null | null |
| double | 6.0 | 48.0 | 65.0 | 8.0 |
However, if we intentionally set hive.vectorized.groupby.maxentries to
-84396285 and hive.groupby.mapaggr.checkinterval to -1152319368, the output
becomes:
| type | | | | | | |
| int | 1 | null | 1 | null | 2 | null |
| string | A | A | A | C | null | A |
| int | 2 | null | 2 | null | 2 | null |
| double | 2.0 | 1.0 | 4.0 | 8.0 | 48.0 | 46.0 |
This is due to the fact that in the time in processBatch,
checkHashModeEfficiency do flush and change to streaming, and the first two
cols do not do the sum operation with the later cols. Later cols do the sum
operation in streaming mode partially correct.
h2. How to reproduce:
The above example is taken from a test in Hive ql module called
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt,
you can follow the steps to reproduce the bug:
(1) Set hive.vectorized.groupby.maxentries to -84396285 and
hive.groupby.mapaggr.checkinterval to -1152319368
(2) Run test
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt.
For an easy reproduction, run the reproduce.sh in the attachment.
You should observe the following failure:
{code}
java.lang.AssertionError: [1, A, 2] expected:<6.0> but was:<2.0>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$ValueValidator.validate(TestVectorGroupByOperator.java:2773)
at
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$12.inspectRow(TestVectorGroupByOperator.java:2463)
at
org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureOutputOperator.process(FakeCaptureOutputOperator.java:94)
at
org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureVectorToRowOutputOperator.process(FakeCaptureVectorToRowOutputOperator.java:162)
at
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
at
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.flushOutput(VectorGroupByOperator.java:1347)
at
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.closeOp(VectorGroupByOperator.java:1355)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:686)
at
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKey(TestVectorGroupByOperator.java:2479)
at
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKeyIntStringInt(TestVectorGroupByOperator.java:896)
{code}
We are happy to provide a patch if this issue is confirmed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)