Re: using MultipleOutputFormat to ensure one output file per key

2014-11-25 Thread Rafal Kwasny
Hi, Arpan Ghosh wrote: > Hi, > > How can I implement a custom MultipleOutputFormat and specify it as > the output of my Spark job so that I can ensure that there is a unique > output file per key (instead of a a unique output file per reducer)? > I use something like this: class KeyBasedOutput[T

Re: Spark output to s3 extremely slow

2014-10-15 Thread Rafal Kwasny
Hi, How large is the dataset you're saving into S3? Actually saving to S3 is done in two steps: 1) writing temporary files 2) commiting them to proper directory Step 2) could be slow because S3 do not have a quick atomic "move" operation, you have to copy (server side but still takes time) and the

Re: S3 Bucket Access

2014-10-14 Thread Rafal Kwasny
Hi, keep in mind that you're going to have a bad time if your secret key contains a "/" This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a "/" /Raf Akhil Das wrote: > Try the following: > > 1. Set

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-06 Thread Rafal Kwasny
Hi, This will work nicely unless you're using spot instances, in this case the "start" does not work as slaves are lost on shutdown. I feel like spark-ec2 script need a major refactor to cope with new features/more users using it in dynamic environments. Are there any current plans to migrate it to