peay created SPARK-21659:
----------------------------

             Summary: FileStreamSink checks for _spark_metadata even if path 
has globs
                 Key: SPARK-21659
                 URL: https://issues.apache.org/jira/browse/SPARK-21659
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, SQL
    Affects Versions: 2.2.0
            Reporter: peay
            Priority: Minor


I am using the GCS connector for Hadoop, and reading a Dataframe using 
{{context.read.format("parquet").load("...")}}.

When my URI has glob patterns of the form
{code}
gs://uri/{a,b,c}
{code}
or as below, Spark incorrectly assumes that it is a single file path, and 
produces this rather verbose exception:

{code}
java.net.URISyntaxException: Illegal character in path at index xx: 
gs://bucket-name/path/to/date=2017-0{1-29,1-30,1-31,2-01,2-02,2-03,2-04}*/_spark_metadata
        at java.net.URI$Parser.fail(URI.java:2848)
        at java.net.URI$Parser.checkChars(URI.java:3021)
        at java.net.URI$Parser.parseHierarchical(URI.java:3105)
        at java.net.URI$Parser.parse(URI.java:3053)
        at java.net.URI.<init>(URI.java:588)
        at 
com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:93)
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:171)
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1421)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
        at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
{code}

I am not quite sure if the GCS connector deviates from the HCFS standard here 
in terms of behavior, but this makes logs really hard to read for jobs that 
load a bunch of files like this.

https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L39
 already has an explicit {{case Seq(singlePath) =>}}, except that it is 
misleading because {{singlePath}} can have wildcards. In addition, it could 
check for non-escaped glob characters, like

{code}
{, }, ?, *
{code}

and go to the multiple-paths case when those are present, where looking for 
metadata is skipped.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to