[jira] [Resolved] (SPARK-24320) Cannot read file names with spaces

Hyukjin Kwon (JIRA) Sun, 20 May 2018 20:10:46 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-24320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-24320.
----------------------------------
    Resolution: Duplicate

>  Cannot read file names with spaces
> -----------------------------------
>
>                 Key: SPARK-24320
>                 URL: https://issues.apache.org/jira/browse/SPARK-24320
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.0
>            Reporter: Zachary Radtka
>            Priority: Major
>
> I am trying to read from a file on HDFS that has space in the file name, e.g. 
> "file 1.csv" and I get a `java.io.FileNotFoundException: File does not exist` 
> error.
> The versions of software I am using are:
>  * Spark: 2.2.0.2.6.3.0-235
>  * Scala: version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
> As an reproducible example I have the same file in HDFS named "file.csv" and 
> "file 1.csv":
> {code:none}
> $ hdfs dfs -ls /tmp
> rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file 1.csv
> rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file.csv{code}
>  
> The following script was used to successfully read from the file that does 
> not have a space in the name:
> {code:java}
> scala> val if1 = "/tmp/file.csv" if1: String = /tmp/file.csv scala> val 
> origTable = spark.read.format("csv").option("header", 
> "true").option("delimiter", ",").option("multiLine", true).option("escape", 
> "\"").load(if1); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] 
> scala> origTable.take(2) res3: Array[org.apache.spark.sql.Row] = Array([DATA 
> REDACTED])
> {code}
>  
> The same script was used to try and read from the file that has a space in 
> the name:
> {code:java}
>  scala> val if2 = "/tmp/file 1.csv"
>  if2: String = /tmp/file 1.csv
> scala> val origTable = spark.read.format("csv").option("header", 
> "true").option("delimiter", ",").option("multiLine", true).option("escape", 
> "\"").load(if2);
>  origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED]
> scala> origTable.take(2)
>  18/05/18 18:58:40 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
>  java.io.FileNotFoundException: File does not exist: /tmp/file%201.csv
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> It is possible the underlying files have been updated. You can explicitly 
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
> SQL or by recreating the Dataset/DataFrame involved.
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  18/05/18 18:58:40 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 8, 
> localhost, executor driver): java.io.FileNotFoundException: File does not 
> exist: /tmp/file%201.csv
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> {code}
> The underlying error is `java.io.FileNotFoundException: File does not exist: 
> /tmp/file%201.csv`. It seems that the CSV reader is URL encoding the path and 
> hence the file is not found.
> I also tested out specifying the file location in HDFS, `val if2 = 
> "hdfs:///tmp/file 1.csv"`, and I received the same error.
> I also tested to ensure that the problem does not exist with Sparks textFile 
> reader. It had no problem reading the file:
> {code}
>  scala> sc.textFile(if2).take(2)
>  res5: Array[String] = Array(DATA REDACTED)
> {code}
> One interesting thing to note is that `printSchema` does work, but when 
> trying to do any operation on the file, a `FileNotFoundError` occurs.
> The temporary work around for this problem is removing spaces from filenames.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-24320) Cannot read file names with spaces

Reply via email to