Jussi-Pekka Partanen created SPARK-22691:
--------------------------------------------

             Summary: Custom HttpFileSystem, issue with question-marks in path
                 Key: SPARK-22691
                 URL: https://issues.apache.org/jira/browse/SPARK-22691
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.2.0
            Reporter: Jussi-Pekka Partanen
            Priority: Minor


I'm working with a use case, which requires several files to be loaded from 
HTTP locations using different file formats (CSV, JSON etc.) using different 
compression methods. I'm using a custom HTTP FileSystem implementation. I'm 
running into an issue, where a question mark character (?) in the HTTP URL 
causes spark to fail with following error. 

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
not exist: http://someserverhere.com/getresults?results=300&format=CSV;
        at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:355)
        at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
        at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
        at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
        at scala.collection.immutable.List.flatMap(List.scala:344)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
        at com.whereos.engine.Sessions.main(Sessions.java:320)

If the HTTP URL doesn't contain ?-character, i.e. it's a HTTP URL for example 
in format http://someserverhere.com/getresults/results=3500/format=CSV it works 
without any problems. 

The data is read with a very simple statement like this one below:

                Dataset<Row> df = context.read()
                                .option("inferSchema", "true")
                                .option("header", "true")
                                .option("quote", "\"")
                                .csv(url);

The custom file system is registered by setting "fs.http.impl" to 
com.test.MyHttpFileSystem.class.getName()

On the MyHttpFileSystem the calls to fs.exist() and fs.getFileStatus() seem to 
be different between the two different cases above (working and failing). The 
working one only checks first if URL/_spark_metadata exists (obviously not), 
and then properly makes a call to 
exists('http://someserverhere.com/getresults/results=3500/format=CSV') and 
fs.getFileStatus('http://someserverhere.com/getresults/results=3500/format=CSV')
 with full URL. 

The failing case first checks for _spark_metadata as well, but the following 
call to exists() and fs.getFileStatus() doesn't anymore include the full path, 
but where the URL path element with '?'-characted is omitted, i.e. the system 
makes a call to fs.exists('http://someserverhere.com/'), instead of 
fs.exists('http://someserverhere.com/getresults?results=300&format=CSV').



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to