[ https://issues.apache.org/jira/browse/SPARK-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949931#comment-14949931 ]
koert kuipers commented on SPARK-10185: --------------------------------------- [~marmbrus] made it clear that the goal is to support commas as part of file paths, and therefore supporting multiple paths will have to be done in a different way. he said he was in favor of supporting multiple paths for all HadoopFSRelations under these conditions: * We must keep source/binary compatibility. * We should give good errors when the source does not support this feature. * For consistency, I'd prefer if we can just add a load(path: String*) (but I'm not sure if this is possible given the above). paths(path: *) is okay, but I think I'd prefer if it was not the terminal operator. pullreq 8416 tries to archieve this. can someone please review it? not having multiple paths is a serious issue for us. > Spark SQL does not handle comma separates paths on Hadoop FileSystem > -------------------------------------------------------------------- > > Key: SPARK-10185 > URL: https://issues.apache.org/jira/browse/SPARK-10185 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.1 > Reporter: koert kuipers > > Spark SQL uses a Map[String, String] for data source settings. As a > consequence the only way to pass in multiple paths (something that hadoop > file input format supports) is to do pass in a comma separated list. For > example: > sqlContext.format("json").load("dir1,dir22") > or > sqlContext.format("json").option("path", "dir1,dir2").load > However in this case ResolvedDataSource does not handle the comma delimited > paths correctly for a HadoopFsRelationProvider. It treats the multiple comma > delimited paths as single path. > For example if i pass in for path "dir1,dir2" it will make dir1 qualified but > ignore dir2 (presumably because it simply treats it as part of dir1). If > globs are involved then it simply always returns an empty array of paths > (because the glob with comma in it doesn’t match anything). > I think its important to handle commas to pass in multiple paths, since the > framework does not provide an alternative. In some cases like parquet the > code simply bypasses ResolvedDataSource to support multiple paths but to me > this is a workaround that should be discouraged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org