On 14/09/16 13:55, Ellison Anne Williams wrote: > In the meantime/very near term, we could provide a step-by-step > AWS/GCP/Azure instructions for bringing up a small cluster, running the > distributed tests, and debugging. Admittedly, most of this is handled in > the AWS/GCP/Azure documentation, but, in my experience, the documentation > is confusing and very time consuming to get through the first time.
So do you advise running bare VMs and installing Hadoop, or running the AWS Elastic Map Reduce service? Here's where I've been going so far, but don't want to start a wiki entry with instructions if this is the wrong approach altogether... - Sign-up for an AWS account. https://aws.amazon.com - Obtain access keys https://console.aws.amazon.com/iam - Install aws command-line tool https://aws.amazon.com/cli - Configure aws tool Choose a default region in the EMR group http://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region $ aws configure AWS Access Key ID [None]: AKIAI44QH8DHBEXAMPLE AWS Secret Access Key [None]: je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY Default region name [None]: eu-east-1 Default output format [None]: text - Create an EC2 key pair, and download e.g. "SparkClusterKeys.pem". - Create a Spark cluster $ aws emr create-cluster \ --name "Spark Cluster" \ --release-label emr-5.0.0 \ --applications Name=Spark \ --ec2-attributes KeyName=SparkClusterKeys \ --instance-type m3.xlarge \ --instance-count 3 \ --use-default-roles answers a cluster ID, e.g. j-3KVTXXXXXX7UG - Upload a JAR file $ aws emr put --cluster-id j-3KVTXXXXXX7UG --key-pair-file SparkClusterKeys.pem --src apache-pirk-0.0.1-SNAPSHOT-exe.jar $ aws emr ssh --cluster-id j-3KVTXXXXXX7UG --key-pair-file SparkClusterKeys.pem --command "hadoop jar <pirkJar> org.apache.pirk.test.distributed.DistributedTestDriver -j <full path to pirkJar>" - Terminate cluster $ aws emr terminate-clusters --cluster-ids j-3KVTXXXXXX7UG Look at charges per hour and think, there may be a better way... Regards, Tim