GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/20715

    [SPARK-23434][SQL][BRANCH-2.2] Spark should not warn `metadata directory` 
for a HDFS file path

    ## What changes were proposed in this pull request?
    
    In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), 
it warns with a wrong warning message during looking up 
`people.json/_spark_metadata`. The root cause of this situation is the 
difference between `LocalFileSystem` and `DistributedFileSystem`. 
`LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` 
raises `org.apache.hadoop.security.AccessControlException`.
    
    ```scala
    scala> spark.version
    res0: String = 2.4.0-SNAPSHOT
    
    scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
    +----+-------+
    | age|   name|
    +----+-------+
    |null|Michael|
    |  30|   Andy|
    |  19| Justin|
    +----+-------+
    
    scala> spark.read.json("hdfs:///tmp/people.json")
    18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
    18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
    ```
    
    After this PR,
    ```scala
    scala> spark.read.json("hdfs:///tmp/people.json").show
    +----+-------+
    | age|   name|
    +----+-------+
    |null|Michael|
    |  30|   Andy|
    |  19| Justin|
    +----+-------+
    ```
    
    ## How was this patch tested?
    
    Manual.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-23434-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20715
    
----
commit 314fae2d36a3b0916fd6e04713a923f1a6f203c2
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-02-21T00:02:44Z

    [SPARK-23434][SQL] Spark should not warn `metadata directory` for a HDFS 
file path
    
    ## What changes were proposed in this pull request?
    
    In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), 
it warns with a wrong warning message during looking up 
`people.json/_spark_metadata`. The root cause of this situation is the 
difference between `LocalFileSystem` and `DistributedFileSystem`. 
`LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` 
raises `org.apache.hadoop.security.AccessControlException`.
    
    ```scala
    scala> spark.version
    res0: String = 2.4.0-SNAPSHOT
    
    scala> 
spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
    +----+-------+
    | age|   name|
    +----+-------+
    |null|Michael|
    |  30|   Andy|
    |  19| Justin|
    +----+-------+
    
    scala> spark.read.json("hdfs:///tmp/people.json")
    18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
    18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
metadata directory.
    ```
    
    After this PR,
    ```scala
    scala> spark.read.json("hdfs:///tmp/people.json").show
    +----+-------+
    | age|   name|
    +----+-------+
    |null|Michael|
    |  30|   Andy|
    |  19| Justin|
    +----+-------+
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #20616 from dongjoon-hyun/SPARK-23434.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to