[jira] [Commented] (ORC-412) [C++] ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data

ASF GitHub Bot (JIRA) Wed, 10 Oct 2018 07:28:27 -0700


    [ 
https://issues.apache.org/jira/browse/ORC-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645040#comment-16645040
 ]


ASF GitHub Bot commented on ORC-412:
------------------------------------

majetideepak commented on a change in pull request #317: ORC-412: [C++] Fix 
Char(n) and Varchar(n) writers with UTF-8
URL: https://github.com/apache/orc/pull/317#discussion_r224082450
 
 

 ##########
 File path: c++/src/ColumnWriter.cc
 ##########
 @@ -940,24 +940,98 @@ namespace orc {
     lengthEncoder->recordPosition(rowIndexPosition.get());
   }
 
+  struct Utf8Utils {
+    /**
+     * Counts how many utf-8 chars of the inout data
+     */
+    static uint64_t charLength(const char * data, uint64_t offset, uint64_t 
length) {
+      uint64_t chars = 0;
+      for (uint64_t i = 0; i < length; i++) {
+        if (isUtfStartByte(data[offset + i])) {
+          chars++;
+        }
+      }
+      return chars;
+    }
+
+    /**
+     * Return the number of bytes required to read at most
+     * maxLength characters in full from a utf-8 encoded byte array provided
+     * by data[offset:offset+length]. This does not validate utf-8 data, but
+     * operates correctly on already valid utf-8 data.
+     *
+     * @param maxCharLength number of bytes required
+     * @param data the bytes of UTF-8
+     * @param offset the first byte location
+     * @param length the length of data to truncate
+     */
+    static uint64_t truncateBytesTo(uint64_t maxCharLength,
+                                    const char * data,
+                                    uint64_t offset,
+                                    uint64_t length) {
+      uint64_t chars = 0;
+      if (length <= maxCharLength) {
+        return length;
+      }
+      for (uint64_t i = 0; i < length; i++) {
+        if (isUtfStartByte(data[offset + i])) {
+          chars++;
+        }
+        if (chars > maxCharLength) {
 
 Review comment:
   Does this capture the case where the last UTF-8 character is multi-byte?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [C++] ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data
> ----------------------------------------------------------------------------------------
>
>                 Key: ORC-412
>                 URL: https://issues.apache.org/jira/browse/ORC-412
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.5.2
>            Reporter: Gang Wu
>            Assignee: Gang Wu
>            Priority: Major
>
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/writer/CharTreeWriter.java#L41
> {code}
>     itemLength = schema.getMaxLength();
>     padding = new byte[itemLength];
>   }
> {code}
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/writer/VarcharTreeWriter.java#L48
> {code}
>       if (vector.noNulls || !vector.isNull[0]) {
>         int itemLength = Math.min(vec.length[0], maxLength);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ORC-412) [C++] ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data

Reply via email to