Pedro Rodriguez created SPARK-4414: -------------------------------------- Summary: SparkContext.wholeTextFiles Doesn't work with S3 Buckets Key: SPARK-4414 URL: https://issues.apache.org/jira/browse/SPARK-4414 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Pedro Rodriguez
SparkContext.wholeTextFiles does not read files which SparkContext.textFile can read. Below are general steps to reproduce, my specific case is following that on a git repo. Steps to reproduce. 1. Create Amazon S3 bucket, make public with multiple files 2. Attempt to read bucket with sc.wholeTextFiles("s3n://mybucket/myfile.txt") 3. Spark returns the following error, even if the file exists. Exception in thread "main" java.io.FileNotFoundException: File does not exist: /myfile.txt at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489) 4. Change the call to sc.textFile("s3n://mybucket/myfile.txt") and there is no error message, the application should run fine. There is a question on StackOverflow as well on this: http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist This is link to repo/lines of code. The uncommented call doesn't work, the commented call works as expected: https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 It would be easy to use textFile with a multifile argument, but this should work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org