Repository: incubator-pirk
Updated Branches:
  refs/heads/gh-pages 14dbf7ade -> 3af67efa4

PIRK-64: Added new Cloud Instructions page with details on how to use AWS, GCP, 
and (in the future) Azure with Pirk


Branch: refs/heads/gh-pages
Commit: e347cfb6a3575c70fc598413ca5e87bee451f738
Parents: 14dbf7a
Author: Jacob Wilder <>
Authored: Tue Sep 20 00:09:10 2016 -0400
Committer: Jacob Wilder <>
Committed: Tue Sep 20 00:11:18 2016 -0400

 _includes/nav.html    |   1 + | 143 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+)
diff --git a/_includes/nav.html b/_includes/nav.html
index 84275f6..eab369f 100755
--- a/_includes/nav.html
+++ b/_includes/nav.html
@@ -17,6 +17,7 @@
         <ul class="dropdown-menu">
           <li id="nav_users"><a href="{{ site.baseurl }}/for_users">For 
           <li id="nav_developers"><a href="{{ site.baseurl 
}}/for_developers">For Developers</a></li>
+          <li id="nav_developers"><a href="{{ site.baseurl 
}}/cloud_instructions">Cloud instructions</a></li>
           <li id="nav_papers"><a href="{{ site.baseurl }}/papers">Papers &amp 
           <li class="nav_faq"><a href="{{ site.baseurl }}/faq">FAQ</a></li>
           <li class="divider"></li>
diff --git a/ b/
new file mode 100644
index 0000000..00340d4
--- /dev/null
+++ b/
@@ -0,0 +1,143 @@
+title: Running Pirk in Cloud Environments (GCP, AWS, Azure)
+nav: nav_commercial_cloud
+## Google Cloud Platform (GCP)
+This guide is a walkthrough with steps to get Pirk running on Google's Cloud 
+### Steps
+ 1. Create a google cloud account. You may be eligible for a 
first-60-days-or-$300 credit. You can do this from 
+ 2. Install the [`gcloud` command line tool]( 
and run `gcloud init`. This will let you authorize the `gcloud` tool against 
your Google account.
+ 3. Create a new project. For example, `pirkongcpexample`
+ 4. Enable billing for that project (somewhere in the user interface). If you 
are a free trial user you may not need to change anything in the billing 
+ 5. Enable the dataproc API. At the time of writing, [this page 
API]( was 
involved in the process of enabling dataproc API (if you have more than one 
project you may need to switch to the correct one using the picker on the page 
next to the Google logo). Ignore any warnings you see about a need to get 
+ 6. Spin up a cluster (replace $PROJECTNAME with the project name you used 
+`gcloud dataproc clusters create cluster-1 --zone us-east1-b 
--master-machine-type n1-standard-2 --master-boot-disk-size 150 --num-workers 3 
--worker-machine-type n1-highmem-2 --worker-boot-disk-size 25 --project 
`**`$PROJECTNAME`**` --properties 
+ 7. Once this completes run `gcloud compute config-ssh`. This adds entries to 
your `~/.ssh/config` which allow you to connect to your cluster nodes. To see 
the list look at your `~/.ssh/config` file. An example: 
+        Host
+               HostName  
+               IdentityFile /Users/pirkdev/.ssh/google_ed25519  
+               HostKeyAlias=compute.1295540156620891161  
+               IdentitiesOnly=yes  
+               CheckHostIP=no 
+    To SSH to this node I type `ssh -D 10010` (the `-m` indicates the master). (The 
`-D 10010` flag is optional and enables a SOCKS proxy you can configure a web 
browser with to see the [web 
+ 8. On GCP the default property `spark.home = /usr` is incorrect. Since 
`/root/` isn't accessible 
+ (and thus putting an additional properties file in 
`` isn't viable) one solution is to modify the
+ compiled-in ``
+ to have `` instead. (You'll want 
to change the `pirkdev` to your username on the node).
+ At `/home/pirkdev/share/` put a file containing 
+ 9. Transfer your compiled jar to the cluster: e.g. `scp 
+ 10. Run your jar. For example: 
+            hadoop jar apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar 
org.apache.pirk.test.distributed.DistributedTestDriver -j 
+            spark-submit --class 
$PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar -j 
$PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar -t 1:JS
+ 11. When you want to stop your cluster: 
+            [pirkdev:~] 2 % gcloud compute instances list
+            NAME           ZONE        MACHINE_TYPE    …
+            cluster-1-m    us-east1-b  n1-standard-2   …
+            cluster-1-w-0  us-east1-b  n1-highmem-2    …
+            cluster-1-w-1  us-east1-b  n1-highmem-2    …
+            cluster-1-w-2  us-east1-b  n1-highmem-2    …
+            [pirkdev:~] 2 % gcloud compute instances stop cluster-1-m  
cluster-1-w-0 cluster-1-w-1 cluster-1-w-2
+ Stop your instances to save money. [To quote 
+> A stopped instance does not incur charges, but all of the resources that are 
attached to 
+> the instance will still be charged. For example, you are charged for 
persistent disks and 
+> external IP addresses according to the price sheet, even if an instance is 
stopped. To stop 
+> being charged for attached resources, you can reconfigure a stopped instance 
to not use 
+> those resources, and then delete the resources.
+## Microsoft Azure
+Right now Pirk can't be run on Microsoft's Azure HDInsight Hadoop platform 
because it only supports Java 7. Committer Jacob Wilder emailed a Microsoft 
engineer who works on HDInsight and heard that it is on Microsoft's roadmap for 
the end of September or October 2016. 
+### Steps that will likely eventually work
+These directions are based on the [basic cli 
 and the article [Create Linux-based clusters in HDInsight using the Azure 
+A note on HDInsight pricing: 
+> HDInsight clusters billing is pro-rated per minute, whether you are using 
them or not. Please be sure to delete your cluster after you have finished 
using it. For information on deleting a cluster, see [How to delete an 
+1. Create a Microsoft Azure account and either add billing information or get 
some credit for it. 
+2. If you haven't used Azure before then deploy [this 
 It will set up your account with licenses for the right resources. Don't 
forget to delete it after deploy. 
+3. Install the Azure CLI and run `azure login` to authenticate.
+4. Enter resource manager mode with `azure config mode arm`
+5. Pick a location from the location list (`azure location list`, e.g. 
`eastus`). Remainder of this tutorial uses location eastus but you can switch 
it out. 
+6. Create the cluster group. This example uses the name `pirkcluster1`, you 
can pick a different name.  `azure group create `**`pirkcluster1`**` eastus``
+7. Create storage to use for the cluster `azure storage account create -g 
`**`pirkcluster1`**` --sku-name RAGRS -l eastus --kind Storage 
+8. Get one of the access keys for the storage account. `key1` is fine. 
+        % azure storage account keys list -g pirkcluster1 pirkstorage
+        info:    Executing command storage account keys list
+        + Getting storage account keys                                
+        data:    Name  Key                                Permissions
+        data:    ----  ----------------------------------------------
+        data:    key1  [a bunch of base64, save THIS]     Full  
+        data:    key2  another bunch of base 64           Full       
+        info:    storage account keys list command OK
+9. Register for the Azure HDInsight provider: `azure provider register 
+10. Create the cluster. Replace the bold faced values. You already have 
`key1_from_above_command`. Select your own ssh and http passwords. In this 
example **`pirkhdinsight1`** is the name that will be used to SSH into the 
cluster and manage it.  
+    `azure hdinsight cluster create -g pirkcluster1 -l eastus -y Linux 
--clusterType Hadoop --defaultStorageAccountName --defaultStorageAccountKey 
`**`key1_from_above_command`**` --defaultStorageContainer 
`**`pirkhdinsight1`**` --workerNodeCount 2 --userName admin --password 
`**`httppassword`**` --sshUserName sshuser --sshPassword `**`sshpassword`**` 
+11. This command takes about 15 minutes. Once it finishes you can log into 
your cluster using `ssh sshuser@`**`pirkhdinsight1`**`` 
where you've replaced **`pirkhdinsight1`** with the name of your cluster. 
+12. You may choose to install your ssh keys at this point using a command like 
`ssh-copy-id -i ~/.ssh/azure_ed25519 -o PubkeyAuthentication=no`
+13. At this point you can't do anything since HDInsight doesn't support Java 
8. Delete your cluster and wait for Azure HDInsight to support Java 8. 
+## Amazon Web Services EC2 EMR
+1. You'll need to have an AWS account with credit or billing information. 
+2. Either create a key pair within the AWS user interface or make one locally 
and upload the public key. Note the exact name of the keypair in the AWS 
interface because it is an argument to later commands. The keypair used in this 
tutorial is called `amazon_rsa` in the amazon user interface and the private 
key is located on the local machine at `~/.ssh/amazon_rsa`
+3. Install the [AWS CLI]( (probably using `pip 
install aws`) and run `aws configure` and input the required Access Key ID and 
Secret associated with your account. 
+4. Run `aws emr create-default-roles`. 
+5. Before you can create a cluster you need to make a JSON file locally. Call 
it (for example) `aws-cluster-conf.json` with these contents:
+        [
+          {
+            "Classification": "yarn-site",
+            "Properties": {
+              "yarn.nodemanager.aux-services": "mapreduce_shuffle",
+              "yarn.nodemanager.aux-services.mapreduce_shuffle.class": 
+            }
+          }
+        ]
+    This configuration file fixes some YARN configuration options that (left 
+    prevent distributed Pirk from running. 
+6. Create the cluster:
+        aws emr create-cluster \
+           --name "Spark Cluster" \
+           --release-label emr-5.0.0 \
+           --applications Name=Spark \
+           --ec2-attributes KeyName=amazon_rsa \
+           --instance-type m3.xlarge \
+           --instance-count 3 \
+           --use-default-roles \
+           --configurations file://./aws-cluster-conf.json
+    Make note of the ClusterID it returns. For the remainder of these steps, 
assume that **`$cid`** has been set equal to the cluster ID (you may find it 
convenient to do this using `export cid=YOURCLUSTERID`)
+7. Wait for your cluster to be ready. You might find this command helpful: 
`watch -n 60 aws emr describe-cluster --output json --cluster-id `**`$cid`**
+8. Once your cluster is ready, go into the [AWS console in your 
browser]( and add a firewall rule 
enabling SSH access. Select the correct region in the upper corner, then click 
on Security Groups in the left hand column. Find the row with the Group Name 
"ElasticMapReduce-master", select the Inbound tab in the lower pane, click 
Edit, and then add a Rule for SSH (in the drop down menu) with Source "My IP" 
(change this to another value if desired).
+9. Upload the jar file (underneath the covers this is running `scp`):  
+`aws emr put --cluster-id `**`$cid`**` --key-pair-file 
`**`~/.ssh/amazon_rsa`**` --src apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar`
+10. SSH in using  
+`aws emr ssh --cluster-id `**`$cid`**` --key-pair-file 
+If you want to SSH in and set up a SOCKS proxy to access the [web 
 (like in the GCP instructions) copy the output of the SSH command and add the 
`-D $SOCKSPORTNUM` flag. The YARN resource manager is on port 8088 of the 
Master node. 
+11. Now on the cluster, you can run the distributed tests:  
+`hadoop jar apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar 
org.apache.pirk.test.distributed.DistributedTestDriver -j 
$PWD/apache-pirk-0.2.0-incubating-SNAPSHOT-exe.jar -t 1:J`
+12. When you are done working with your cluster, terminate it:  
+`aws emr terminate-clusters --cluster-ids `**`$cid`**
\ No newline at end of file

Reply via email to