Jussi-Pekka Partanen created SPARK-22691: --------------------------------------------
Summary: Custom HttpFileSystem, issue with question-marks in path Key: SPARK-22691 URL: https://issues.apache.org/jira/browse/SPARK-22691 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.2.0 Reporter: Jussi-Pekka Partanen Priority: Minor I'm working with a use case, which requires several files to be loaded from HTTP locations using different file formats (CSV, JSON etc.) using different compression methods. I'm using a custom HTTP FileSystem implementation. I'm running into an issue, where a question mark character (?) in the HTTP URL causes spark to fail with following error. Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: http://someserverhere.com/getresults?results=300&format=CSV; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:355) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at com.whereos.engine.Sessions.main(Sessions.java:320) If the HTTP URL doesn't contain ?-character, i.e. it's a HTTP URL for example in format http://someserverhere.com/getresults/results=3500/format=CSV it works without any problems. The data is read with a very simple statement like this one below: Dataset<Row> df = context.read() .option("inferSchema", "true") .option("header", "true") .option("quote", "\"") .csv(url); The custom file system is registered by setting "fs.http.impl" to com.test.MyHttpFileSystem.class.getName() On the MyHttpFileSystem the calls to fs.exist() and fs.getFileStatus() seem to be different between the two different cases above (working and failing). The working one only checks first if URL/_spark_metadata exists (obviously not), and then properly makes a call to exists('http://someserverhere.com/getresults/results=3500/format=CSV') and fs.getFileStatus('http://someserverhere.com/getresults/results=3500/format=CSV') with full URL. The failing case first checks for _spark_metadata as well, but the following call to exists() and fs.getFileStatus() doesn't anymore include the full path, but where the URL path element with '?'-characted is omitted, i.e. the system makes a call to fs.exists('http://someserverhere.com/'), instead of fs.exists('http://someserverhere.com/getresults?results=300&format=CSV'). -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org