Anyone dealing with a lot of files with spark? We're trying s3a with 2.0.1 because we're seeing intermittent errors in S3 where jobs fail and saveAsText file fails. Using pyspark.
Is there any issue with working in a S3 folder that has too many files? How about having versioning enabled? Are these things going to be a problem? We're pre-building the s3 file list and storing it in a file and passing that to textFile as a long comma separated list of files - So we are not running list files. But we get errors with saveAsText file, related to ListBucket. Even though we're not using wildcard '*'. org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Failed to parse XML document with handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler Running spark 2.0.1 with the s3a protocol. thanks