ConfX created HIVE-27497:
----------------------------

             Summary: Vector groupby operation produces incorrect result when 
malicious configuration is intentionally set
                 Key: HIVE-27497
                 URL: https://issues.apache.org/jira/browse/HIVE-27497
             Project: Hive
          Issue Type: Bug
          Components: Hive
            Reporter: ConfX
         Attachments: reproduce.sh

h2. What happened:
Hive VectorGroupByOperator produces incorrect groupby results when 
configuration parameter hive.vectorized.groupby.maxentries and 
hive.groupby.mapaggr.checkinterval are set to some bad value, for example, 
negative value. When Hive first in hash mode processing aggregation, the 
malicious configuration can forcibly change processing mode to streaming mode, 
and the cols in hash mode will be flushed out and not aggregate with the cols 
in streaming mode, which makes the groupby result incorrect. 

This issue is very severe because an attacker can control the value of these 2 
parameters to control the malicious behavior; also there should have some value 
checkers for such parameters, at least a negative value checker for these 2 
parameters.

h2. Buggy code:
In hash mode, the vector groupby operator checks hash efficiency at the end of 
each processBatch. 
{code:java}
@Override
public void processBatch(VectorizedRowBatch batch) throws HiveException {
  ...
  if (this instanceof ProcessingModeHashAggregate) {
    // Check if we should turn into streaming mode
    ((ProcessingModeHashAggregate)this).checkHashModeEfficiency();
  }
}
{code}
In checkHashModeEfficiency, there are three if-conditions that control whether 
to switch to streaming mode.
{code:java}
private void checkHashModeEfficiency() throws HiveException {
  if (lastModeCheckRowCount > numRowsCompareHashAggr) {
    ...
    final long inputRecords = sumBatchSize;
    final long outputRecords = numEntriesHashTable + 
numFlushedOutEntriesBeforeFinalFlush;
    final float ratio = (outputRecords) / (inputRecords * 1.0f);
    if (ratio > minReductionHashAggr) {
      if (inputRecords > maxHtEntries) {
        flush(true);
        changeToStreamingMode();
      }
    }
  }
}
{code}
Here numRowsCompareHashAggr is get from hive.groupby.mapaggr.checkinterval; 
minReductionHashAggr is get from hive.vectorized.groupby.maxentries. If these 
two configuration parameters are set to small enough values, e.g. negative 
values, then these 2 if statements are very likely to be always true. In this 
way, the flush will be called and changed to streaming mode.

This mode changing somehow leads to incorrect grouby aggregation.
Take a look at the following example, we have:
| type   |      |     |     |      |      |      |      |
| int    | null | 1   | 1   | null | 2    | 2    | null |
| string | A    | A   | A   | C    | null | null | A    |
| int    | null | 2   | 2   | null | 2    | 2    | null |
| double | 1.0  | 2.0 | 4.0 | 8.0  | 16.0 | 32.0 | 64.0 |

Now if we want to do a sum, the expected result would be:

| type   |     |      |      |      |
| int    | 1   | 2    | null | null |
| string | A   | null | A    | C    |
| int    | 2   | 2    | null | null |
| double | 6.0 | 48.0 | 65.0 | 8.0  |

However, if we intentionally set hive.vectorized.groupby.maxentries to 
-84396285 and hive.groupby.mapaggr.checkinterval to -1152319368, the output 
becomes:

| type   |     |      |     |      |      |      |
| int    | 1   | null | 1   | null | 2    | null |
| string | A   | A    | A   | C    | null | A    |
| int    | 2   | null | 2   | null | 2    | null |
| double | 2.0 | 1.0  | 4.0 | 8.0  | 48.0 | 46.0 |

This is due to the fact that in the time in processBatch, 
checkHashModeEfficiency do flush and change to streaming, and the first two 
cols do not do the sum operation with the later cols. Later cols do the sum 
operation in streaming mode partially correct.

h2. How to reproduce:
The above example is taken from a test in Hive ql module called 
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt,
 you can follow the steps to reproduce the bug:
(1) Set hive.vectorized.groupby.maxentries to -84396285 and 
hive.groupby.mapaggr.checkinterval to -1152319368
(2) Run test 
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator#testMultiKeyIntStringInt.

For an easy reproduction, run the reproduce.sh in the attachment.

You should observe the following failure:
{code}
java.lang.AssertionError: [1, A, 2] expected:<6.0> but was:<2.0>                
                                                                                
                                       
        at org.junit.Assert.fail(Assert.java:89)                                
                                                                                
                                       
        at org.junit.Assert.failNotEquals(Assert.java:835)                      
             
        at org.junit.Assert.assertEquals(Assert.java:120)                       
             
        at 
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$ValueValidator.validate(TestVectorGroupByOperator.java:2773)
        at 
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator$12.inspectRow(TestVectorGroupByOperator.java:2463)
        at 
org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureOutputOperator.process(FakeCaptureOutputOperator.java:94)
        at 
org.apache.hadoop.hive.ql.exec.vector.util.FakeCaptureVectorToRowOutputOperator.process(FakeCaptureVectorToRowOutputOperator.java:162)
        at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.flushOutput(VectorGroupByOperator.java:1347)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.closeOp(VectorGroupByOperator.java:1355)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:686)
        at 
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKey(TestVectorGroupByOperator.java:2479)
        at 
org.apache.hadoop.hive.ql.exec.vector.TestVectorGroupByOperator.testMultiKeyIntStringInt(TestVectorGroupByOperator.java:896)
{code}

We are happy to provide a patch if this issue is confirmed.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to