SiasDoming opened a new issue, #1973:
URL: https://github.com/apache/orc/issues/1973
I'm migrating from Core-C++ to Core-Java. But while reading data of type
`CHAR(n)`, I found the `BytesColumnVector.length` in Java has a different
semantic compared with `StringVectorBatch.length` in C++. In Java, with the
following code, it refers to the number of bytes with padding blanks trimmed,
while `length` in C++ refers to the total number of bytes including padding
blanks. For example, reading value `'ABC'` of `CHAR(10)` in Java will get a
length `3` instead of `10` in C++. I'm wondering why trimmed lengths are
preferred in Java.
PS: Maybe any one of these implementation is acceptable for you, as long as
the semantics are same among APIs of different programming languages, but I
have to say that the 'redundant' processing in Java did annoy me. I have to
reallocate a byte array and pad the bytes again manually for further usage. And
the trimmed lengths prevent me from using direct memory copy (although this is
still achievable if I'd like to depend on the internal implementation).
```Java
public static class CharTreeReader extends StringTreeReader {
...
@Override
public void nextVector(ColumnVector previousVector,
boolean[] isNull,
final int batchSize,
FilterContext filterContext,
ReadPhase readPhase) throws IOException {
...
// TreeReaderFactory.java:2474
// TreeReaderFactory.java:2483
// TreeReaderFactory.java:2493
adjustedDownLen = StringExpr
.rightTrimAndTruncate(result.vector[i], result.start[i],
result.length[i], maxLength);
if (adjustedDownLen < result.length[i]) {
result.setRef(i, result.vector[i], result.start[i],
adjustedDownLen);
}
...
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]