koert kuipers created SPARK-10185:
-------------------------------------
Summary: Spark SQL does not handle comma separates paths on Hadoop
FileSystem
Key: SPARK-10185
URL: https://issues.apache.org/jira/browse/SPARK-10185
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.1
Reporter: koert kuipers
Spark SQL uses a Map[String, String] for data source settings. As a consequence
the only way to pass in multiple paths (something that hadoop file input format
supports) is to do pass in a comma separated list. For example:
sqlContext.format("json").load("dir1,dir22")
or
sqlContext.format("json").option("path", "dir1,dir2").load
However in this case ResolvedDataSource does not handle the comma delimited
paths correctly for a HadoopFsRelationProvider. It treats the multiple comma
delimited paths as single path.
For example if i pass in for path "dir1,dir2" it will make dir1 qualified but
ignore dir2 (presumably because it simply treats it as part of dir1). If globs
are involved then it simply always returns an empty array of paths (because the
glob with comma in it doesn’t match anything).
I think its important to handle commas to pass in multiple paths, since the
framework does not provide an alternative. In some cases like parquet the code
simply bypasses ResolvedDataSource to support multiple paths but to me this is
a workaround that should be discouraged.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]