[I] [Java] Different semantic of lengths for CHAR(n) with C++ [orc]

via GitHub Thu, 11 Jul 2024 01:39:19 -0700


SiasDoming opened a new issue, #1973:
URL: https://github.com/apache/orc/issues/1973


   I'm migrating from Core-C++ to Core-Java. But while reading data of type 
`CHAR(n)`, I found the `BytesColumnVector.length` in Java has a different 
semantic compared with `StringVectorBatch.length` in C++. In Java, with the 
following code, it refers to the number of bytes with padding blanks trimmed, 
while `length` in C++ refers to the total number of bytes including padding 
blanks. For example, reading value `'ABC'` of `CHAR(10)` in Java will get a 
length `3` instead of `10` in C++. I'm wondering why trimmed lengths are 
preferred in Java.
   PS: Maybe any one of these implementation is acceptable for you, as long as 
the semantics are same among APIs of different programming languages, but I 
have to say that the 'redundant' processing in Java did annoy me. I have to 
reallocate a byte array and pad the bytes again manually for further usage. And 
the trimmed lengths prevent me from using direct memory copy (although this is 
still achievable if I'd like to depend on the internal implementation).
   
   ```Java
     public static class CharTreeReader extends StringTreeReader {
     ...
       @Override
       public void nextVector(ColumnVector previousVector,
                              boolean[] isNull,
                              final int batchSize,
                              FilterContext filterContext,
                              ReadPhase readPhase) throws IOException {
         ...
           // TreeReaderFactory.java:2474
           // TreeReaderFactory.java:2483
           // TreeReaderFactory.java:2493
           adjustedDownLen = StringExpr
               .rightTrimAndTruncate(result.vector[i], result.start[i], 
result.length[i], maxLength);
           if (adjustedDownLen < result.length[i]) {
             result.setRef(i, result.vector[i], result.start[i], 
adjustedDownLen);
           }
         ...
       }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Java] Different semantic of lengths for CHAR(n) with C++ [orc]

Reply via email to