[ 
https://issues.apache.org/jira/browse/SPARK-21659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21659.
----------------------------------
    Resolution: Incomplete

> FileStreamSink checks for _spark_metadata even if path has globs
> ----------------------------------------------------------------
>
>                 Key: SPARK-21659
>                 URL: https://issues.apache.org/jira/browse/SPARK-21659
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>    Affects Versions: 2.2.0
>            Reporter: peay
>            Priority: Minor
>              Labels: bulk-closed
>
> I am using the GCS connector for Hadoop, and reading a Dataframe using 
> {{context.read.format("parquet").load("...")}}.
> When my URI has glob patterns of the form
> {code}
> gs://uri/{a,b,c}
> {code}
> or as below, Spark incorrectly assumes that it is a single file path, and 
> produces this rather verbose exception:
> {code}
> java.net.URISyntaxException: Illegal character in path at index xx: 
> gs://bucket-name/path/to/date=2017-0{1-29,1-30,1-31,2-01,2-02,2-03,2-04}*/_spark_metadata
>       at java.net.URI$Parser.fail(URI.java:2848)
>       at java.net.URI$Parser.checkChars(URI.java:3021)
>       at java.net.URI$Parser.parseHierarchical(URI.java:3105)
>       at java.net.URI$Parser.parse(URI.java:3053)
>       at java.net.URI.<init>(URI.java:588)
>       at 
> com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:93)
>       at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:171)
>       at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1421)
>       at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
>       at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
>       at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
>       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>       at py4j.Gateway.invoke(Gateway.java:280)
>       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>       at py4j.commands.CallCommand.execute(CallCommand.java:79)
>       at py4j.GatewayConnection.run(GatewayConnection.java:214)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> I am not quite sure if the GCS connector deviates from the HCFS standard here 
> in terms of behavior, but this makes logs really hard to read for jobs that 
> load a bunch of files like this.
> https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L39
>  already has an explicit {{case Seq(singlePath) =>}}, except that it is 
> misleading because {{singlePath}} can have wildcards. In addition, it could 
> check for non-escaped glob characters, like
> {code}
> {, }, ?, *
> {code}
> and go to the multiple-paths case when those are present, where looking for 
> metadata is skipped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to