[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

HyukjinKwon Fri, 25 Aug 2017 21:49:41 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18581#discussion_r135381676
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
 ---
    @@ -32,7 +32,9 @@ import 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
      * in that file.
      */
     class HadoopFileLinesReader(
    -    file: PartitionedFile, conf: Configuration) extends Iterator[Text] 
with Closeable {
    +    file: PartitionedFile,
    +    lineSeparator: Option[String],
    --- End diff --
    
    Thanks for clarifying it. Here is my investigation:
    
    > When the line delimiter is '\n', any of the follow sequences will count 
as a delimiter: "\n", "\r\n", or "\r"
    
    With this input:
    
    ```
    a\nb\r\nc\rd
    ```
    
    Case with `\n`:
    
    ```sql
    CREATE EXTERNAL TABlE tbl(value STRING)
    ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
    STORED AS TEXTFILE LOCATION '...';
    ```
    
    ```sql
    SELECT value FROM tbl;
    ```
    
    produced
    
    ```
    a
    b
    c
    d
    ```
    
    This looks incorrect. I _guess_ `\n` is not being set and it looks working 
as the default behaviour in `LineRecordReader`.
    
    
    > Accepting a single "\r" is pretty strange, but that's what Hive does so 
we emulate this behavior.
    
    Case with `\r`:
    
    ```sql
    CREATE EXTERNAL TABlE tbl(value STRING) 
    ROW FORMAT DELIMITED LINES TERMINATED BY '\r'
    STORED AS TEXTFILE LOCATION '...';
    ```
    
    produced
    
    ```
    FAILED: SemanticException 2:41 LINES TERMINATED BY only supports newline 
'\n' right now. Error encountered near token ''\r''
    ...
    org.apache.hadoop.hive.ql.parse.SemanticException: 2:41 LINES TERMINATED BY 
only supports newline '\n' right now. Error encountered near token ''\r''
    ```
    
    This looks related with https://issues.apache.org/jira/browse/HIVE-5999
    
    and these lines:
    
    
https://github.com/apache/hive/blob/696be9f52dfc6fb59c24de19726b4460100fc9ba/ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java#L198-L203
    
    if I am not mistaken.
    
    I am curious how this case was tested in the JIRA. If this test used 
`textinputformat.record.delimiter`, then, this seems Hadoop's property, which 
is basically the same thing as what I am doing here.
    
    
    > Is Hive using Hadoop's LineRecordReader?
    
    In the case above, the input format was 
`org.apache.hadoop.mapred.TextInputFormat`, which uses `LineRecordReader`. 
    
    > How does Hive support it?
    
    It looks Hive tries to support it by `LINES TERMINATED BY '\r'` 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable.
 I could not find other (formal) ways.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

Reply via email to