[ 
https://issues.apache.org/jira/browse/SPARK-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949931#comment-14949931
 ] 

koert kuipers commented on SPARK-10185:
---------------------------------------

[~marmbrus] made it clear that the goal is to support commas as part of file 
paths, and therefore supporting multiple paths will have to be done in a 
different way. he said he was in favor of supporting multiple paths for all 
HadoopFSRelations under these conditions:
* We must keep source/binary compatibility.
* We should give good errors when the source does not support this feature.
* For consistency, I'd prefer if we can just add a load(path: String*) (but I'm 
not sure if this is possible given the above). paths(path: *) is okay, but I 
think I'd prefer if it was not the terminal operator.

pullreq 8416 tries to archieve this. can someone please review it? not having 
multiple paths is a serious issue for us.


> Spark SQL does not handle comma separates paths on Hadoop FileSystem
> --------------------------------------------------------------------
>
>                 Key: SPARK-10185
>                 URL: https://issues.apache.org/jira/browse/SPARK-10185
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: koert kuipers
>
> Spark SQL uses a Map[String, String] for data source settings. As a 
> consequence the only way to pass in multiple paths (something that hadoop 
> file input format supports) is to do pass in a comma separated list. For 
> example:
> sqlContext.format("json").load("dir1,dir22")
> or
> sqlContext.format("json").option("path", "dir1,dir2").load
> However in this case ResolvedDataSource does not handle the comma delimited 
> paths correctly for a HadoopFsRelationProvider. It treats the multiple comma 
> delimited paths as single path.
> For example if i pass in for path "dir1,dir2" it will make dir1 qualified but 
> ignore dir2 (presumably because it simply treats it as part of dir1). If 
> globs are involved then it simply always returns an empty array of paths 
> (because the glob with comma in it doesn’t match anything).
> I think its important to handle commas to pass in multiple paths, since the 
> framework does not provide an alternative. In some cases like parquet the 
> code simply bypasses ResolvedDataSource to support multiple paths but to me 
> this is a workaround that should be discouraged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to