[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

Paul Rogers (JIRA) Mon, 08 May 2017 18:35:50 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-5489:
-------------------------------
    Description: 
Suppose a user runs a query of form:
{code}
SELECT columns[70000] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
        for(Integer i : columnIds){
          ...
          fields[i] = true;
        }
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

While we are at it, we might as well fix another bit of bogus code:

{code}
        for(Integer i : columnIds){
          maxField = 0;
          maxField = Math.max(maxField, i);
          ...
        }
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
        earlyTerm = ! parseField();
        ...
        if (earlyTerm) {
          if (ch != newLine) {
            input.skipLines(1);
          }
          break;
        }
{code}

And:

{code}
  private final boolean parseField() throws IOException {
      ...
      return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
    ...
    return fieldIndex < maxField;
  }
{code}

The only reason that this does not cause data loss is the insertion of an 
unnecessary sort earlier in the constructor:

{code}
        Collections.sort(columnIds);
      ...
        for (Integer i : columnIds){
{code}

We can remove the sort and compute the maxField correct and get the same result 
with simpler code.

  was:
Suppose a user runs a query of form:
{code}
SELECT columns[70000] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
        for(Integer i : columnIds){
          ...
          fields[i] = true;
        }
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

While we are at it, we might as well fix another bit of bogus code:

{code}
        for(Integer i : columnIds){
          maxField = 0;
          maxField = Math.max(maxField, i);
          ...
        }
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
        earlyTerm = ! parseField();
        ...
        if (earlyTerm) {
          if (ch != newLine) {
            input.skipLines(1);
          }
          break;
        }
{code}

And:

{code}
  private final boolean parseField() throws IOException {
      ...
      return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
    ...
    return fieldIndex < maxField;
  }
{code}



> Unprotected array access in RepeatedVarCharOutput ctor
> ------------------------------------------------------
>
>                 Key: DRILL-5489
>                 URL: https://issues.apache.org/jira/browse/DRILL-5489
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Suppose a user runs a query of form:
> {code}
> SELECT columns[70000] FROM `dfs`.`mycsv.csv`
> {code}
> Internally, this will create a {{PathSegment}} to represent the selected 
> column. This is passed into the {{RepeatedVarCharOutput}} constructor where 
> it is used to set a flag in an array of 64K booleans. But, while the code is 
> very diligent of making sure that the column name is "columns" and that the 
> path segment is an array, it does not check the array value. Instead:
> {code}
>         for(Integer i : columnIds){
>           ...
>           fields[i] = true;
>         }
> {code}
> We need to add a bounds check to reject array indexes that are not valid: 
> negative or above 64K. It may be that the code further up the hierarchy does 
> the checks. But, if so, it should do the other checks as well. Leaving the 
> checks incomplete is confusing.
> While we are at it, we might as well fix another bit of bogus code:
> {code}
>         for(Integer i : columnIds){
>           maxField = 0;
>           maxField = Math.max(maxField, i);
>           ...
>         }
> {code}
> The code to compute maxField simply uses the last value, not the maximum 
> value. This will be thrown off by a query of the form:
> {code}
> SELECT columns[20], columns[1] FROM ...
> {code}
> This will muck up the text reader. The text reader "shortcuts" the rest of 
> the line if it need not read any more fields. In {{TextReader.parseRecord()}}:
> {code}
>         earlyTerm = ! parseField();
>         ...
>         if (earlyTerm) {
>           if (ch != newLine) {
>             input.skipLines(1);
>           }
>           break;
>         }
> {code}
> And:
> {code}
>   private final boolean parseField() throws IOException {
>       ...
>       return output.endField();
> {code}
> And, in {{RepeatedVarCharOutput.endField()}}:
> {code}
>   public boolean endField() {
>     ...
>     return fieldIndex < maxField;
>   }
> {code}
> The only reason that this does not cause data loss is the insertion of an 
> unnecessary sort earlier in the constructor:
> {code}
>         Collections.sort(columnIds);
>       ...
>         for (Integer i : columnIds){
> {code}
> We can remove the sort and compute the maxField correct and get the same 
> result with simpler code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

Reply via email to