[GitHub] [arrow] sagnikc-dremio commented on a change in pull request #7641: ARROW-9328: [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string

GitBox Mon, 20 Jul 2020 22:24:48 -0700


sagnikc-dremio commented on a change in pull request #7641:
URL: https://github.com/apache/arrow/pull/7641#discussion_r457843985




##########
File path: cpp/src/gandiva/precompiled/string_ops.cc
##########
@@ -322,6 +387,138 @@ const char* trim_utf8(gdv_int64 context, const char* 
data, gdv_int32 data_len,
   return data + start;
 }
 
+// Trims characters present in the trim text from the left end of the base text
+FORCE_INLINE
+const char* ltrim_utf8_utf8(gdv_int64 context, const char* basetext,
+                            gdv_int32 basetext_len, const char* trimtext,
+                            gdv_int32 trimtext_len, int32_t* out_len) {
+  if (basetext_len == 0) {
+    *out_len = 0;
+    return "";
+  } else if (trimtext_len == 0) {
+    *out_len = basetext_len;
+    return basetext;
+  }
+
+  gdv_int32 start_ptr, char_len;
+  // scan the base text from left to right and increment the start pointer till
+  // there is a character which is not present in the trim text
+  for (start_ptr = 0; start_ptr < basetext_len; start_ptr += char_len) {
+    char_len = utf8_char_length(basetext[start_ptr]);
+    if (!is_substr_utf8_utf8(trimtext, trimtext_len, basetext + start_ptr, 
char_len)) {
+      break;
+    }
+  }
+
+  // the first character from the left is not present in the trim text,
+  // hence there is nothing to be trimmed, return original string
+  if (start_ptr == 0) {
+    *out_len = basetext_len;
+    return basetext;
+  }
+
+  // all the characters in the base text are present in the trim text,
+  // hence trim the entire string, return empty string
+  if (start_ptr == trimtext_len) {
+    *out_len = 0;
+    return "";
+  }
+
+  // base text has some characters which are not present in the trim text
+  *out_len = basetext_len - start_ptr;
+  return basetext + start_ptr;
+}
+
+// Trims characters present in the trim text from the right end of the base 
text
+FORCE_INLINE
+const char* rtrim_utf8_utf8(gdv_int64 context, const char* basetext,
+                            gdv_int32 basetext_len, const char* trimtext,
+                            gdv_int32 trimtext_len, int32_t* out_len) {
+  if (basetext_len == 0) {
+    *out_len = 0;
+    return "";
+  } else if (trimtext_len == 0) {
+    *out_len = basetext_len;
+    return basetext;
+  }
+
+  gdv_int32 end_cnt, char_len, end_ptr = 0;
+  // scan the base text from left to right and increment the end pointer to 
the current
+  // position when there is a character which is not present in the trim text
+  for (end_cnt = 0; end_cnt < basetext_len; end_cnt += char_len) {

Review comment:
       I did consider that option previously and it worked well and good for 
single-byte character strings, but I was facing some issues with multibyte 
character strings.
   
   So, we need to decode the utf8 char and compute its length before we can 
move to the subsequent character. So if we are traversing from right to left, 
we need to find the byte position of the last character and check if that 
character belongs to the trimtext character set, then find the byte position of 
the second-to-last character and so on. It can be done, but it looks a little 
complex to me, whereas while moving from left to right, we don't come across 
this.
   
   Maybe there is a better way to do it. Please suggest if you have anything in 
mind.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] sagnikc-dremio commented on a change in pull request #7641: ARROW-9328: [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string

Reply via email to