[
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182650#comment-16182650
]
Marco Gaido commented on SPARK-22146:
-------------------------------------
If you look carefully at the file which Spark is looking for, you'll see that
it doesn't exist because it is the result of a improper encoding.
So, yes, the right file exists. But Spark is looking for a wrong one.
We tried both on HDFS and on the local filesystem, the error is the same, and
it is due to the encoding of the path in the inferSchema process. I am
preparing a PR to fix it. I will post it as soon as it is ready.
> FileNotFoundException while reading ORC files containing '%'
> ------------------------------------------------------------
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
> at
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
> at
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
> at
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
> at
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
> at
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
> at
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
> at scala.Option.orElse(Option.scala:289)
> at
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
> ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]