Re: [PR] [SPARK-46488][SQL] Skipping trimAll call during timestamp parsing [spark]

via GitHub Mon, 25 Dec 2023 02:19:45 -0800


MaxGekk commented on code in PR #44463:
URL: https://github.com/apache/spark/pull/44463#discussion_r1436048800



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##########
@@ -619,6 +616,29 @@ trait SparkDateTimeUtils {
       case NonFatal(_) => None
     }
   }
+
+  /**
+   * This method retrieves the start and end indices of a byte array after 
trimming
+   * any whitespace or ISO control characters.
+   * This way we can avoid allocating a new string with trimAll method
+   * and just operate between the trimmed indices.
+   *
+   * @param bytes The byte array to be trimmed.
+   * @return A tuple of two integers; first being the start and second the end 
trimmed index.
+   */
+  private def getTrimmedStartEnd(bytes: Array[Byte]): (Int, Int) = {
+    var (start, end) = (0, bytes.length - 1)
+
+    while (start < bytes.length && 
UTF8String.isWhitespaceOrISOControl(bytes(start))) {
+      start += 1
+    }
+
+    while (end > start && UTF8String.isWhitespaceOrISOControl(bytes(end))) {
+      end -= 1
+    }
+
+    (start, end + 1)

Review Comment:
   Don't you create a `Tuple` instance here. Is it possible to avoid this? For 
example, define two separate `inline` functions:
   - `getTrimmedStart(bytes: Array[Byte]): Int`
   - `getTrimmedEnd(bytes: Array[Byte], start: Int): Int`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46488][SQL] Skipping trimAll call during timestamp parsing [spark]

Reply via email to