[jira] [Created] (SPARK-10185) Spark SQL does not handle comma separates paths on Hadoop FileSystem

koert kuipers (JIRA) Mon, 24 Aug 2015 10:49:24 -0700

koert kuipers created SPARK-10185:
-------------------------------------

             Summary: Spark SQL does not handle comma separates paths on Hadoop 
FileSystem
                 Key: SPARK-10185
                 URL: https://issues.apache.org/jira/browse/SPARK-10185
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.1
            Reporter: koert kuipers



Spark SQL uses a Map[String, String] for data source settings. As a consequence 
the only way to pass in multiple paths (something that hadoop file input format 
supports) is to do pass in a comma separated list. For example:
sqlContext.format("json").load("dir1,dir22")
or
sqlContext.format("json").option("path", "dir1,dir2").load

However in this case ResolvedDataSource does not handle the comma delimited 
paths correctly for a HadoopFsRelationProvider. It treats the multiple comma 
delimited paths as single path.

For example if i pass in for path "dir1,dir2" it will make dir1 qualified but 
ignore dir2 (presumably because it simply treats it as part of dir1). If globs 
are involved then it simply always returns an empty array of paths (because the 
glob with comma in it doesn’t match anything).

I think its important to handle commas to pass in multiple paths, since the 
framework does not provide an alternative. In some cases like parquet the code 
simply bypasses ResolvedDataSource to support multiple paths but to me this is 
a workaround that should be discouraged.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-10185) Spark SQL does not handle comma separates paths on Hadoop FileSystem

Reply via email to