[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

Ewen Cheslack-Postava (JIRA) Mon, 18 May 2015 17:28:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549562#comment-14549562
 ]


Ewen Cheslack-Postava commented on SPARK-6527:
----------------------------------------------

Here's a stack trace, which looks like it's incorrectly trying to use the local 
filesystem to open the file:

{quote}
java.io.FileNotFoundException: File /path/to/file does not exist.
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
        at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
        at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
        at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
        at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
        at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:44)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC.<init>(<console>:39)
        at $iwC.<init>(<console>:41)
        at <init>(<console>:43)
        at .<init>(<console>:47)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
        at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
        at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
        at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
        at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
        at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
        at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{quote}

This only gets this far if your initial binaryFiles parameter ends on a / 
delimiter. If you try to match an arbitrary prefix it fails trying to list the 
files:

{quote}
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: s3n://path/to/files/partial-prefix
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
        at 
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:47)
        at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:43)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)
{quote}

It also fails that way if you provide a complete path to a file, so you can't 
seem to load a binary file from S3 at all currently.

> sc.binaryFiles can not access files on s3
> -----------------------------------------
>
>                 Key: SPARK-6527
>                 URL: https://issues.apache.org/jira/browse/SPARK-6527
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2, Input/Output
>    Affects Versions: 1.2.0, 1.3.0
>         Environment: I am running Spark on EC2
>            Reporter: Zhao Zhang
>            Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

Reply via email to