[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

Paul Rogers (JIRA) Mon, 08 May 2017 15:28:32 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001672#comment-16001672
 ]


Paul Rogers commented on DRILL-5486:
------------------------------------

More cascading errors. In {{RepeatedVarCharOutput.finishBatch()}}:

{code}
  public void finishBatch() {
    mutator.setValueCount(batchIndex);
  }
{code}

Here, we are using {{batchIndex}} to be the row count. We must consider the two 
cases: it is right, and it is off-by-one.

For the correct case, let's look at {{BaseRepeatedValueVector.setValueCount()}}:

{code}
    public void setValueCount(int valueCount) {
      // TODO: populate offset end points
      offsets.getMutator().setValueCount(valueCount == 0 ? 0 : valueCount+1);
      final int childValueCount = valueCount == 0 ? 0 : 
offsets.getAccessor().get(valueCount);
      vector.getMutator().setValueCount(childValueCount);
    }
{code}

Consider that the row count is one. The offset vector must contain two entries: 
\[0, x], where x is the length of the first (and only) array. That is the 
offset count is always one more that the row count (except when we have zero 
rows.)

Some background if we have just one row:

* The first entry must always be zero.
* To get the length of the ith row, we do: offset\[i+1] - offset\[i];

So, we need the result \[0, final-offset].

Then, we need to tell the array the number of entries it contains. This is the 
sum of all array entries in the batch. That value is in the (row count)st 
position of the offset vector as mentioned above.

This works because the {{RepeatedVarCharOutput}} updates the (I+1)st postiion 
for each row.

Now, consider the case with the bug mentioned earlier: we update the offset 
vector, but not the row count. Consider this input:

{code}

a,b,c
{code}

The first line is blank, so the {{batchIndex}} is *not* incremented, but the 
offset vector *is* written. At the end:

{code}
batchIndex = 1
offset vector = [0, 0, 3]
{code}

Then, the above method is given a row count of 1, and so:

{code}
count of row offsets: 2
offset values: [0, 0, 3] (but, value of 3 is ignored because count is set to 2)
childValueCount = row offsets[1] = 0
`columns` array item count: 0 (though there are actually three children)
{code}

Then, the next layer sets the row count to 2 (not knowing that the first row 
was skipped.)

So, in the end we have:

* A row count of 2
* A `columns` offset vector of length 2: \[0, 0], meaning 1 row of zero length.
* A `columns` item offset vector of length 0, meaning an empty array (in the 
only row)

This is an unstable situation: 2 rows at the container level, but only one row 
set up in the `columns` vector.

This is very confusing because the code is wrong, so one not only has to 
understand what should happen, but the incorrect behavior that actually happens.


> Compliant text reader tries, but fails, to ignore empty rows
> ------------------------------------------------------------
>
>                 Key: DRILL-5486
>                 URL: https://issues.apache.org/jira/browse/DRILL-5486
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
>     ...
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>     ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
>     recordStart = characterData;
>     ...
>     int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
>     PlatformDependent.putInt(repeatedOffset, newOffset);
>     repeatedOffset += 4;
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the next level of code keeps a separate 
> count, so the "skip empty row" logic causes the counts to get out of sync 
> between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the 
> first row.
> * That bug is masked because we always set the row offset, even if we ignore 
> the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting 
> the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

Reply via email to