[ 
https://issues.apache.org/jira/browse/DRILL-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5529:
-------------------------------
    Description: 
Consider the Drill {{NullableVarCharVector}} type. This vector is composed of 
three buffers (also called vectors):

* Is-set (bit) vector: contains 1 if the value is set, 0 if it is null.
* Data vector; effectively a byte array in which each value is packed one after 
another.
*  Offset vector, in which the entry for each row points to the first byte of 
the value in the data vector.

Suppose we have the values "foo", null, "bar". Then, the vectors contain:

{code}
Is-Set: [1 0 1]
Offsets: [0 3 3 6]
Data:  [f o o b a r]
{code}

(Yes, there is one more offset entry than rows.)

Suppose that the code creating the vector writes values for rows 1 and 3, but 
omits 2 (it is null, which is the default). How do we get that required value 
of 3 in the entry for row 2? The answer is that the logic for setting a value 
keeps track of the last write position and "backfills" missing offset values:

{code}
    public void setSafe(int index, ByteBuffer value, int start, int length) {
      if (index > lastSet + 1) {
        fillEmpties(index);
      }
      ...
{code}

So, when we write the value for row 3 ("bar") we back-fill the missing offset 
for row 2. So far so good.

We can now generalize. We must to the same trick any time that we use a vector 
that uses an offset vector. There are three other cases:

* Required variable-width vectors (where a missing value is the same as an 
empty string).
* A repeated fixed-width vector.
* A repeated variable-width vector (which has *two* offset vectors).

The problem is, none of these actually provide the required code. The caller 
must implement its own back-fill logic else the offset vectors become corrupted.

Consider the required {{VarCharVector}}:

{code}
    protected void set(int index, byte[] bytes, int start, int length) {
      assert index >= 0;
      final int currentOffset = offsetVector.getAccessor().get(index);
      offsetVector.getMutator().set(index + 1, currentOffset + length);
      data.setBytes(currentOffset, bytes, start, length);
    }
{code}

As a result of this omission, any client which skips null values will corrupt 
offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing 
record 2 (empty string). Desired result:

{code}
Data: [t r y f o o b a r]
Offsets: [0 3 6 6 9]
{code}

Actual result:

{code}
Data: [t r y f o o b a r]
Offsets: [0 3 6 0 9]
{code}

The result is that we compute the width of field 2 as -6, not 3. The value of 
the empty field is 9, not 0.

A similar issue arrises with repeated vectors. Consider 
{{RepeatedVarCharVector}}:

{code}
    public void addSafe(int index, byte[] bytes, int start, int length) {
      final int nextOffset = offsets.getAccessor().get(index+1);
      values.getMutator().setSafe(nextOffset, bytes, start, length);
      offsets.getMutator().setSafe(index+1, nextOffset+1);
    }
{code}

Consider this example: (\["a", "b"], \[ ], \["d", "e"]).

Expected:
{code}
Array Offset: [0 2 2 4]
Value Offset: [0 1 2 3 4]
Data: [a b d e]
{code}

Actual:
{code}
Array Offset: [0 2 0 4]
Value Offset: [0 1 2 3 4]
Data: [a b d e]
{code}

The entry for the (unwritten) position 2 is missing.

This bug may be the root cause of several other issues found recently. 
(Potentially DRILL-5470 -- need to verify.)

Two resolutions are possible:

* Require that client code write all values, backfilling empty or null values 
as needed.
* Generalize the mutators to back-fill in all cases, not just a nullable var 
char.

A related issue occurs when a reader fails to do a "final fill" at the end of a 
batch (DRILL-5487).

  was:
Consider the Drill {{OptionalVarCharVector}} type. This vector is composed of 
three buffers (also called vectors):

* Is-set (bit) vector: contains 1 if the value is set, 0 if it is null.
* Data vector; effectively a byte array in which each value is packed one after 
another.
*  Offset vector, in which the entry for each row points to the first byte of 
the value in the data vector.

Suppose we have the values "foo", null, "bar". Then, the vectors contain:

{code}
Is-Set: [1 0 1]
Offsets: [0 3 3 6]
Data:  [f o o b a r]
{code}

(Yes, there is one more offset entry than rows.)

Suppose that the code creating the vector writes values for rows 1 and 3, but 
omits 2 (it is null, which is the default). How do we get that required value 
of 3 in the entry for row 2? The answer is that the logic for setting a value 
keeps track of the last write position and "backfills" missing offset values:

{code}
    public void setSafe(int index, ByteBuffer value, int start, int length) {
      if (index > lastSet + 1) {
        fillEmpties(index);
      }
      ...
{code}

So, when we write the value for row 3 ("bar") we back-fill the missing offset 
for row 2. So far so good.

We can now generalize. We must to the same trick any time that we use a vector 
that uses an offset vector. There are three other cases:

* Required variable-width vectors (where a missing value is the same as an 
empty string).
* A repeated fixed-width vector.
* A repeated variable-width vector (which has *two* offset vectors).

The problem is, none of these actually provide the required code. The caller 
must implement its own back-fill logic else the offset vectors become corrupted.

Consider the required {{VarCharVector}}:

{code}
    protected void set(int index, byte[] bytes, int start, int length) {
      assert index >= 0;
      final int currentOffset = offsetVector.getAccessor().get(index);
      offsetVector.getMutator().set(index + 1, currentOffset + length);
      data.setBytes(currentOffset, bytes, start, length);
    }
{code}

As a result of this omission, any client which skips null values will corrupt 
offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing 
record 2 (empty string). Desired result:

{code}
Data: [t r y f o o b a r]
Offsets: [0 3 6 6 9]
{code}

Actual result:

{code}
Data: [t r y f o o b a r]
Offsets: [0 3 6 0 9]
{code}

The result is that we compute the width of field 2 as -6, not 3. The value of 
the empty field is 9, not 0.

A similar issue arrises with repeated vectors. Consider 
{{RepeatedVarCharVector}}:

{code}
    public void addSafe(int index, byte[] bytes, int start, int length) {
      final int nextOffset = offsets.getAccessor().get(index+1);
      values.getMutator().setSafe(nextOffset, bytes, start, length);
      offsets.getMutator().setSafe(index+1, nextOffset+1);
    }
{code}

Consider this example: (\["a", "b"], \[ ], \["d", "e"]).

Expected:
{code}
Array Offset: [0 2 2 4]
Value Offset: [0 1 2 3 4]
Data: [a b d e]
{code}

Actual:
{code}
Array Offset: [0 2 0 4]
Value Offset: [0 1 2 3 4]
Data: [a b d e]
{code}

The entry for the (unwritten) position 2 is missing.

This bug may be the root cause of several other issues found recently. 
(Potentially DRILL-5470 -- need to verify.)

Two resolutions are possible:

* Require that client code write all values, backfilling empty or null values 
as needed.
* Generalize the mutators to back-fill in all cases, not just a nullable var 
char.

A related issue occurs when a reader fails to do a "final fill" at the end of a 
batch (DRILL-5487).


> Fixed width, repeated vector mutators missing "fill empties" logic
> ------------------------------------------------------------------
>
>                 Key: DRILL-5529
>                 URL: https://issues.apache.org/jira/browse/DRILL-5529
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.11.0
>
>
> Consider the Drill {{NullableVarCharVector}} type. This vector is composed of 
> three buffers (also called vectors):
> * Is-set (bit) vector: contains 1 if the value is set, 0 if it is null.
> * Data vector; effectively a byte array in which each value is packed one 
> after another.
> *  Offset vector, in which the entry for each row points to the first byte of 
> the value in the data vector.
> Suppose we have the values "foo", null, "bar". Then, the vectors contain:
> {code}
> Is-Set: [1 0 1]
> Offsets: [0 3 3 6]
> Data:  [f o o b a r]
> {code}
> (Yes, there is one more offset entry than rows.)
> Suppose that the code creating the vector writes values for rows 1 and 3, but 
> omits 2 (it is null, which is the default). How do we get that required value 
> of 3 in the entry for row 2? The answer is that the logic for setting a value 
> keeps track of the last write position and "backfills" missing offset values:
> {code}
>     public void setSafe(int index, ByteBuffer value, int start, int length) {
>       if (index > lastSet + 1) {
>         fillEmpties(index);
>       }
>       ...
> {code}
> So, when we write the value for row 3 ("bar") we back-fill the missing offset 
> for row 2. So far so good.
> We can now generalize. We must to the same trick any time that we use a 
> vector that uses an offset vector. There are three other cases:
> * Required variable-width vectors (where a missing value is the same as an 
> empty string).
> * A repeated fixed-width vector.
> * A repeated variable-width vector (which has *two* offset vectors).
> The problem is, none of these actually provide the required code. The caller 
> must implement its own back-fill logic else the offset vectors become 
> corrupted.
> Consider the required {{VarCharVector}}:
> {code}
>     protected void set(int index, byte[] bytes, int start, int length) {
>       assert index >= 0;
>       final int currentOffset = offsetVector.getAccessor().get(index);
>       offsetVector.getMutator().set(index + 1, currentOffset + length);
>       data.setBytes(currentOffset, bytes, start, length);
>     }
> {code}
> As a result of this omission, any client which skips null values will corrupt 
> offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing 
> record 2 (empty string). Desired result:
> {code}
> Data: [t r y f o o b a r]
> Offsets: [0 3 6 6 9]
> {code}
> Actual result:
> {code}
> Data: [t r y f o o b a r]
> Offsets: [0 3 6 0 9]
> {code}
> The result is that we compute the width of field 2 as -6, not 3. The value of 
> the empty field is 9, not 0.
> A similar issue arrises with repeated vectors. Consider 
> {{RepeatedVarCharVector}}:
> {code}
>     public void addSafe(int index, byte[] bytes, int start, int length) {
>       final int nextOffset = offsets.getAccessor().get(index+1);
>       values.getMutator().setSafe(nextOffset, bytes, start, length);
>       offsets.getMutator().setSafe(index+1, nextOffset+1);
>     }
> {code}
> Consider this example: (\["a", "b"], \[ ], \["d", "e"]).
> Expected:
> {code}
> Array Offset: [0 2 2 4]
> Value Offset: [0 1 2 3 4]
> Data: [a b d e]
> {code}
> Actual:
> {code}
> Array Offset: [0 2 0 4]
> Value Offset: [0 1 2 3 4]
> Data: [a b d e]
> {code}
> The entry for the (unwritten) position 2 is missing.
> This bug may be the root cause of several other issues found recently. 
> (Potentially DRILL-5470 -- need to verify.)
> Two resolutions are possible:
> * Require that client code write all values, backfilling empty or null values 
> as needed.
> * Generalize the mutators to back-fill in all cases, not just a nullable var 
> char.
> A related issue occurs when a reader fails to do a "final fill" at the end of 
> a batch (DRILL-5487).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to