Hi,
Arpan Ghosh wrote:
> Hi,
>
> How can I implement a custom MultipleOutputFormat and specify it as
> the output of my Spark job so that I can ensure that there is a unique
> output file per key (instead of a a unique output file per reducer)?
>
I use something like this:
class KeyBasedOutput[T
Hi,
How large is the dataset you're saving into S3?
Actually saving to S3 is done in two steps:
1) writing temporary files
2) commiting them to proper directory
Step 2) could be slow because S3 do not have a quick atomic "move"
operation, you have to copy (server side but still takes time) and the
Hi,
keep in mind that you're going to have a bad time if your secret key
contains a "/"
This is due to old and stupid hadoop bug:
https://issues.apache.org/jira/browse/HADOOP-3733
Best way is to regenerate the key so it does not include a "/"
/Raf
Akhil Das wrote:
> Try the following:
>
> 1. Set
Hi,
This will work nicely unless you're using spot instances, in this case
the "start" does not work as slaves are lost on shutdown.
I feel like spark-ec2 script need a major refactor to cope with new
features/more users using it in dynamic environments.
Are there any current plans to migrate it to