Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r135381676 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala --- @@ -32,7 +32,9 @@ import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl * in that file. */ class HadoopFileLinesReader( - file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable { + file: PartitionedFile, + lineSeparator: Option[String], --- End diff -- Thanks for clarifying it. Here is my investigation: > When the line delimiter is '\n', any of the follow sequences will count as a delimiter: "\n", "\r\n", or "\r" With this input: ``` a\nb\r\nc\rd ``` Case with `\n`: ```sql CREATE EXTERNAL TABlE tbl(value STRING) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '...'; ``` ```sql SELECT value FROM tbl; ``` produced ``` a b c d ``` This looks incorrect. I _guess_ `\n` is not being set and it looks working as the default behaviour in `LineRecordReader`. > Accepting a single "\r" is pretty strange, but that's what Hive does so we emulate this behavior. Case with `\r`: ```sql CREATE EXTERNAL TABlE tbl(value STRING) ROW FORMAT DELIMITED LINES TERMINATED BY '\r' STORED AS TEXTFILE LOCATION '...'; ``` produced ``` FAILED: SemanticException 2:41 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''\r'' ... org.apache.hadoop.hive.ql.parse.SemanticException: 2:41 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''\r'' ``` This looks related with https://issues.apache.org/jira/browse/HIVE-5999 and these lines: https://github.com/apache/hive/blob/696be9f52dfc6fb59c24de19726b4460100fc9ba/ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java#L198-L203 if I am not mistaken. I am curious how this case was tested in the JIRA. If this test used `textinputformat.record.delimiter`, then, this seems Hadoop's property, which is basically the same thing as what I am doing here. > Is Hive using Hadoop's LineRecordReader? In the case above, the input format was `org.apache.hadoop.mapred.TextInputFormat`, which uses `LineRecordReader`. > How does Hive support it? It looks Hive tries to support it by `LINES TERMINATED BY '\r'` https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable. I could not find other (formal) ways.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org