Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/18581#discussion_r135381676
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
---
@@ -32,7 +32,9 @@ import
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
* in that file.
*/
class HadoopFileLinesReader(
- file: PartitionedFile, conf: Configuration) extends Iterator[Text]
with Closeable {
+ file: PartitionedFile,
+ lineSeparator: Option[String],
--- End diff --
Thanks for clarifying it. Here is my investigation:
> When the line delimiter is '\n', any of the follow sequences will count
as a delimiter: "\n", "\r\n", or "\r"
With this input:
```
a\nb\r\nc\rd
```
Case with `\n`:
```sql
CREATE EXTERNAL TABlE tbl(value STRING)
ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '...';
```
```sql
SELECT value FROM tbl;
```
produced
```
a
b
c
d
```
This looks incorrect. I _guess_ `\n` is not being set and it looks working
as the default behaviour in `LineRecordReader`.
> Accepting a single "\r" is pretty strange, but that's what Hive does so
we emulate this behavior.
Case with `\r`:
```sql
CREATE EXTERNAL TABlE tbl(value STRING)
ROW FORMAT DELIMITED LINES TERMINATED BY '\r'
STORED AS TEXTFILE LOCATION '...';
```
produced
```
FAILED: SemanticException 2:41 LINES TERMINATED BY only supports newline
'\n' right now. Error encountered near token ''\r''
...
org.apache.hadoop.hive.ql.parse.SemanticException: 2:41 LINES TERMINATED BY
only supports newline '\n' right now. Error encountered near token ''\r''
```
This looks related with https://issues.apache.org/jira/browse/HIVE-5999
and these lines:
https://github.com/apache/hive/blob/696be9f52dfc6fb59c24de19726b4460100fc9ba/ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java#L198-L203
if I am not mistaken.
I am curious how this case was tested in the JIRA. If this test used
`textinputformat.record.delimiter`, then, this seems Hadoop's property, which
is basically the same thing as what I am doing here.
> Is Hive using Hadoop's LineRecordReader?
In the case above, the input format was
`org.apache.hadoop.mapred.TextInputFormat`, which uses `LineRecordReader`.
> How does Hive support it?
It looks Hive tries to support it by `LINES TERMINATED BY '\r'`
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable.
I could not find other (formal) ways.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]