GitHub user kevinyu98 opened a pull request:

    https://github.com/apache/spark/pull/21285

    [SPARK-24176][SQL] LOAD DATA can't identify wildcard in the hdfs file path 

    ## What changes were proposed in this pull request?
    
    When the wildcard characters (like "?") were in the LOAD DATA command's 
path name, the Path related API (hadoop and URI) couldn't parse it correctly. 
For example:
    `val srcPath = new Path(hdfsUri)` in the `tables.scala`, returned wrong 
result for the following cases:
    - `hdfsUri` = `file: /user/testdemo1/user1/t??eddata60.txt`, 
       `srcPath` = `file:/user/testdemo1/user1/t`
    - `hdfsUri` = `file:/user/testdemo1/user1/?eddata60.txt'`, 
       `srcPath` = `file:/user/testdemo1/user1/`
    (the same problem exists at `val uriPath = uri.getPath()`)
    
     The LOAD DATA LOCAL works  because the local case called a utility 
`Utils.resolveURI` to replaced the "?" to "%3F", then the PATH API will not 
truncate the file name.
    
    This fix uses `Utils.resolveURI` method for both local and non-local cases.
    
    I did similar test on hive, it seems the hive has the same behavior.
    
    `hive> load data inpath 'hdfs:/tmp/?evin.txt' into table foo1;
    FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/?evin.txt'': 
No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/%3Fevin.txt
    hive> load data inpath 'hdfs:/tmp/k?evin.txt' into table foo1;
    FAILED: SemanticException Line 1:17 Invalid path ''hdfs:/tmp/k?evin.txt'': 
No files matching path hdfs://stcindia-node-6.fyre.ibm.com:8020/tmp/k%3Fevin.txt
    hive> 
    `
    ## How was this patch tested?
    Did the unit test locally, and added new test cases.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kevinyu98/spark spark-24176

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21285
    
----
commit 3c1a1cf9fbf23fe9c6a0c32090558dc8d7156871
Author: Kevin Yu <qyu@...>
Date:   2018-05-09T18:53:31Z

    resolve the path string for load data before using it

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to