Anyone dealing with a lot of files with spark?  We're trying s3a with 2.0.1
because we're seeing intermittent errors in S3 where jobs fail and
saveAsText file fails. Using pyspark.

Is there any issue with working in a S3 folder that has too many files?
How about having versioning enabled? Are these things going to be a problem?

We're pre-building the s3 file list and storing it in a file and passing
that to textFile as a long comma separated list of files - So we are not
running list files.

But we get errors with saveAsText file, related to ListBucket.  Even though
we're not using wildcard '*'.

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
Failed to parse XML document with handler class
org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler


Running spark 2.0.1 with the s3a protocol.

thanks

Reply via email to