Hi Ted!

Thank you for your notes; your timing is perfect as I just now (finally) have enough time to start working on this in earnest.

I have started collecting some info - there's not much on Drill per se, but there are what seem to be analogous efforts in using Spark and R on EMR.

While apologizing for posting links, I would appreciate any and all input as to whether a similar approach might work with Drill.

1.) The upshot is to use a boostrap script to install software on cluster startup. In another project, I had had success using R in Hadoop streaming (i.e., sans RHadoop) so I was thinking this might be a way to get the Drillbits onto the slave (core/task in AWS) nodes in the cluster. Then one would ssh to the master and subsequently connect to the core nodes.

Installing Apache Spark on an Amazon EMR Cluster
http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster

...this next entry illustrates getting RHadoop onto the core nodes:

Statistical Analysis with Open-Source R and RStudio on Amazon EMR
http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR

Does that make sense?

2.) I noticed that Amazon is now offering MapR instances in EMR; would there be any advantage to using one of these instead of the "stock" Amazon instances?

http://aws.amazon.com/elasticmapreduce/mapr/

3.) While I've only been reading and not (yet) directly experimenting, I'm not sure I understand your note on UDP multicast. Is there a link to the Drill wiki or documentation where this is explained (i.e., why this is pertinent)?

At the risk of redundancy, here's my slightly-refined basic (hypothetical) plan:

i.) Create EMR instance with Drill bits installed on core/task/slave nodes with a bootstrap script

ii.) Load data onto S3 and create Hive external table: have data partitioned into "folders" (i.e., with S3 prefixes) and associate Hive indexes with this partitioning. This would only be used for creating the framework in the Hive metadata store that Drill could leverage.

iii.) Connect to the master node via SSH and then to a slave node (I guess via jdbc) to then query Hive ala:

https://cwiki.apache.org/confluence/display/DRILL/Querying+Hive

Does this seem to make sense?

:-)

Again, thank you for the notes!

Best,

Iver



On 10/31/2014 5:56 PM, Ted Dunning wrote:
Iver,

Didn't see if you got an answer.

Yes... you would just start drill bits on each node separately.  You might
have some troubles because Amazon disables UDP multicast by default.  That
issue should be resolved soon.



On Sun, Oct 19, 2014 at 4:30 PM, Iver Walkoe<[email protected]>  wrote:

Hello!

This is my first query to the group - at present, I'm an inexperienced
Drill user though looking to change that.

I am pretty familiar with AWS - though not as much at the config level -
and can make my way around Hadoop.

That being said, and noting I'm going to be following up with Amazon
people on this as well, I thought I'd post a question here just in case
there were some readily available resources.

I'm looking to investigate the possibility of using Drill with Hive on an
EMR instance pointed toward an external table on S3. That is, I'd be
looking to use Hive to create the metadata for an external table on S3 and
have Drill leverage this.

In particular, I am pretty clueless as to how one would get Drill
installed on the slave nodes on an EMR instance. Don't know if it's
possible, in fact (hoping it is). It would seem that getting Drill (Bits)
on the slave nodes and then being able to communicate with a Drill Bit on
such a node is the task at hand.

Any and all suggestions are greatly appreciated!

Thanks!

Iver




Reply via email to