[ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517886#comment-14517886
 ] 

Tristan Nixon commented on SPARK-4414:
--------------------------------------

Thanks, [~petedmarsh], I was having this same issue. It worked fine on my OS X 
laptop but not on an ec2 linux instance I set up with the spark-c2 script. My 
local version was built with Hadoop 2.4, but the default for systems configured 
from the script is Hadoop 1. It seems that this problem goes to the S3 drivers 
in the different versions of Hadoop.

I destroyed and then re-launched my ec2 cluster using the 
--hadoop-major-version=2 option, and the resulting version works!

Perhaps support for Hadoop 1 should be deprecated? At least, it probably should 
no longer be the default version used in the spark-ec2 scripts.

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> --------------------------------------------------------
>
>                 Key: SPARK-4414
>                 URL: https://issues.apache.org/jira/browse/SPARK-4414
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Pedro Rodriguez
>            Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>       at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to