Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20727#discussion_r172341859
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala
---
@@ -39,9 +39,12 @@ private[text] class TextOptions(@transient private val
parameters: CaseInsensiti
*/
val wholeText = parameters.getOrElse(WHOLETEXT, "false").toBoolean
+ val lineSeparator: String = parameters.getOrElse(LINE_SEPARATOR, "\n")
+ require(lineSeparator.nonEmpty, s"'$LINE_SEPARATOR' cannot be an empty
string.")
}
private[text] object TextOptions {
val COMPRESSION = "compression"
val WHOLETEXT = "wholetext"
+ val LINE_SEPARATOR = "lineSep"
--- End diff --
Why is it not "lineSeparator"? I would propose another name for the option:
recordSeparator. Could you image you have the text file:
```
id: 123
cmd: ls -l
---
id: 456
cmd: rm -rf
```
where the separator is `---`. If the separator is not new line delimiter,
records don't looks like lines. And recordSeparator would be closer to Hadoop's
terminology. Besides of that, probably, we introduce similar option for another
datasources like json. recordSeparator of json records (not lines) sounds
better from my point of view.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]