srowen commented on a change in pull request #26027: [SPARK-24540][SQL] Support
for multiple character delimiter in Spark CSV read
URL: https://github.com/apache/spark/pull/26027#discussion_r332182154
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
##########
@@ -79,4 +82,48 @@ object CSVExprUtils {
throw new IllegalArgumentException(s"Delimiter cannot be more than one
character: $str")
}
}
+
+ /**
+ * Helper method that converts string representation of a character sequence
to actual
+ * delimiter characters. The input is processed in "chunks", and each chunk
is converted
+ * by calling [[CSVExprUtils.toChar()]]. A chunk is either:
+ * <ul>
+ * <li>a backslash followed by another character</li>
+ * <li>a non-backslash character by itself</li>
+ * </ul>
+ * , in that order of precedence. The result of the converting all chunks is
returned as
+ * a [[String]].
+ *
+ * <br/><br/>Examples:
+ * <ul><li>`\t` will result in a single tab character as the separator (same
as before)
+ * </li><li>`|||` will result in a sequence of three pipe characters as the
separator
+ * </li><li>`\\` will result in a single backslash as the separator (same as
before)
+ * </li><li>`\.` will result in an error (since a dot is not a character
that needs escaped)
+ * </li><li>`\\.` will result in a backslash, then dot, as the separator
character sequence
+ * </li><li>`.\t.` will result in a dot, then tab, then dot as the separator
character sequence
+ * </li>
+ * </ul>
+ *
+ * @param str the string representing the sequence of separator characters
+ * @return a [[String]] representing the multi-character delimiter
+ * @throws IllegalArgumentException if any of the individual input chunks
are illegal
+ */
+ def toDelimiterStr(str: String): String = {
+ var idx = 0
+
+ val delimiter = ListBuffer.empty[Char]
+
+ while (idx < str.length()) {
+ // if the current character is a backslash, check it plus the next char
+ // in order to use existing escape logic
+ val readAhead = if (str(idx) == '\\') 2 else 1
Review comment:
I don't think anything but Scala is handling the `\u0000` case. The String
is one character by the time any of this executes. I think you'd find this
_doesn't_ work if you write `"\\u0000"`, which is what you would have to do to
actually encounter the 6-character string `\u0000` here. But then you'd
interpret the delimiter as `u0000`. Same problem as the `"\\"` case.
To your test case below -- unicode unescaping happens before everything, so
`"""\u0000"""` still yields a 1-character delimiter.
I would suggest punting on this right here, but I am kind of concerned about
the '"\\"` case in general.
I would remove the added comment above about backslash escaping because
AFAICT people should just be using the language's string literal syntax for
expressing the chars, and we actually shouldn't further unescape them, but we
can leave that much unchanged here.
We may find this whole loop is unnecessary as a result.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]