Hi,
As I pointed out in my previous email, the C++ code has an optimization for
the cases where (i) there are no null values; (ii) or all values are null.
Java code path does not have it. I am trying to implement this feature. It
would look something like:
public int isSet(int index) {
if(nullCount == valueCount)
return 0;
else if (nullCount == 0)
return 1;
else {
final int byteIndex = index >> 3;
final byte b = validityBuffer.getByte(byteIndex);
final int bitIndex = index & 7;
return (b >> bitIndex) & 0x01;
}
}
The current problem is that "nullCount" is not explicitly tracked in the
Java code. It is checked by calling
public int getNullCount() {
return BitVectorHelper.getNullCount(validityBuffer, valueCount);
}
which is not very optimal, and cannot be called everytime in isSet(). I see
in the source code there is a TODO about this
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L75
which says: "Right now BaseValueVector is the top level base class for
other vector types in ValueVector hierarchy (non-nullable) and those
vectors have not yet been refactored/removed so moving things to the top
class as of now is not a good idea."
(1) I am not sure what this means? can someone explain? Why is not a good
idea?
(2) I think there is another branch of AbstractContainerVector which does
not share BaseValueVector class as the top-level base class.
AbstractContainerVector implements ValueVector (which is an interface).
In the C++ code, data and bitmap are both stored in the top-level Array
class, which probably is not possible in the Java implementation. However
we can move the bitmap operations to the "BaseValueVector" class. I don't
know what to do about the AbstractContainerVector path. Perhaps some code
needs to be duplicated there.
(3) Is this the right design choice? Any inputs?
Thanks,
--
Animesh