[ 
https://issues.apache.org/jira/browse/SPARK-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10185:
------------------------------------

    Assignee:     (was: Apache Spark)

> Spark SQL does not handle comma separates paths on Hadoop FileSystem
> --------------------------------------------------------------------
>
>                 Key: SPARK-10185
>                 URL: https://issues.apache.org/jira/browse/SPARK-10185
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: koert kuipers
>
> Spark SQL uses a Map[String, String] for data source settings. As a 
> consequence the only way to pass in multiple paths (something that hadoop 
> file input format supports) is to do pass in a comma separated list. For 
> example:
> sqlContext.format("json").load("dir1,dir22")
> or
> sqlContext.format("json").option("path", "dir1,dir2").load
> However in this case ResolvedDataSource does not handle the comma delimited 
> paths correctly for a HadoopFsRelationProvider. It treats the multiple comma 
> delimited paths as single path.
> For example if i pass in for path "dir1,dir2" it will make dir1 qualified but 
> ignore dir2 (presumably because it simply treats it as part of dir1). If 
> globs are involved then it simply always returns an empty array of paths 
> (because the glob with comma in it doesn’t match anything).
> I think its important to handle commas to pass in multiple paths, since the 
> framework does not provide an alternative. In some cases like parquet the 
> code simply bypasses ResolvedDataSource to support multiple paths but to me 
> this is a workaround that should be discouraged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to