Re: Drill on Amazon?

Iver Walkoe Sat, 01 Nov 2014 10:17:37 -0700

Hi Ted!

Thank you for your notes; your timing is perfect as I just now (finally)have enough time to start working on this in earnest.

I have started collecting some info - there's not much on Drill per se,but there are what seem to be analogous efforts in using Spark and R on EMR.

While apologizing for posting links, I would appreciate any and allinput as to whether a similar approach might work with Drill.

1.) The upshot is to use a boostrap script to install software oncluster startup. In another project, I had had success using R in Hadoopstreaming (i.e., sans RHadoop) so I was thinking this might be a way toget the Drillbits onto the slave (core/task in AWS) nodes in thecluster. Then one would ssh to the master and subsequently connect tothe core nodes.


Installing Apache Spark on an Amazon EMR Cluster
http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster

...this next entry illustrates getting RHadoop onto the core nodes:

Statistical Analysis with Open-Source R and RStudio on Amazon EMR
http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR

Does that make sense?

2.) I noticed that Amazon is now offering MapR instances in EMR; wouldthere be any advantage to using one of these instead of the "stock"Amazon instances?


http://aws.amazon.com/elasticmapreduce/mapr/

3.) While I've only been reading and not (yet) directly experimenting,I'm not sure I understand your note on UDP multicast. Is there a link tothe Drill wiki or documentation where this is explained (i.e., why thisis pertinent)?

At the risk of redundancy, here's my slightly-refined basic(hypothetical) plan:

i.) Create EMR instance with Drill bits installed on core/task/slavenodes with a bootstrap script

ii.) Load data onto S3 and create Hive external table: have datapartitioned into "folders" (i.e., with S3 prefixes) and associate Hiveindexes with this partitioning. This would only be used for creating theframework in the Hive metadata store that Drill could leverage.

iii.) Connect to the master node via SSH and then to a slave node (Iguess via jdbc) to then query Hive ala:


https://cwiki.apache.org/confluence/display/DRILL/Querying+Hive

Does this seem to make sense?

:-)

Again, thank you for the notes!

Best,

Iver



On 10/31/2014 5:56 PM, Ted Dunning wrote:

Iver,

Didn't see if you got an answer.

Yes... you would just start drill bits on each node separately.  You might
have some troubles because Amazon disables UDP multicast by default.  That
issue should be resolved soon.



On Sun, Oct 19, 2014 at 4:30 PM, Iver Walkoe<[email protected]>  wrote:

Hello!

This is my first query to the group - at present, I'm an inexperienced
Drill user though looking to change that.

I am pretty familiar with AWS - though not as much at the config level -
and can make my way around Hadoop.

That being said, and noting I'm going to be following up with Amazon
people on this as well, I thought I'd post a question here just in case
there were some readily available resources.

I'm looking to investigate the possibility of using Drill with Hive on an
EMR instance pointed toward an external table on S3. That is, I'd be
looking to use Hive to create the metadata for an external table on S3 and
have Drill leverage this.

In particular, I am pretty clueless as to how one would get Drill
installed on the slave nodes on an EMR instance. Don't know if it's
possible, in fact (hoping it is). It would seem that getting Drill (Bits)
on the slave nodes and then being able to communicate with a Drill Bit on
such a node is the task at hand.

Any and all suggestions are greatly appreciated!

Thanks!

Iver

Re: Drill on Amazon?

Reply via email to