[jira] Commented: (CASSANDRA-342) hadoop integration

Jeff Hodges (JIRA) Thu, 20 Aug 2009 22:46:42 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745809#action_12745809
 ]


Jeff Hodges commented on CASSANDRA-342:
---------------------------------------


Okay, before we talk about the boot code, let me address some of the
confusion about Hadoop.

In Hadoop, there are things called Jobs, which are a combination of a
Map and a Reduce operation and the InputFormat configuration you
specify which are then run across a bunch of machines.

A Task is an individual Map or Reduce operation run on one of those
machines (so every Job has many Tasks). For every new Task needed, a
new JVM is booted up.[1]

This is actually okay, distributed-systems-wise, because it keeps all
the Tasks from interfering with one another.

It does, however, make our jobs harder. There is no way for a Task
(and thus this Hadoop code in these patches) to access the runtime of
a Cassandra node already on the machine because they will be in
separate JVMs!

HBase, as I mentioned above, solves this problem by first starting up
HBase on those remote machines, and then having each Task create an
HTable object from the InputSplit handed to it. This HTable object
connects to the local HBase process. (Of course, this same thing
happens in the JVM that creates the InputSplits.)

So, here's my deal. There is no way for this currently designed system
to work efficiently in a distributed system. This is because we have
to boot a brand new Cassandra process on machines that might already
have (and need if hardware is limited) one running already. The boot
up time for Cassandra alone is a big time sink. And consider how these
nodes would interoperate with the "stable", non-Hadoop nodes that
would start sending them data. Ugh.

We can avoid all of this boot time drama if we can come up with a
good way of remotely accessing all of the internal information we need
from the Cassandra node already running. I have not been able to come
up with an alternative solution.

Comments?

[1] There is something called "Task reuse" that can be configured into
a Hadoop deployment. However, the "reuse" only means that a Task can
be used more than once for the same Job. So, it's basically just
another complicating factor in our boot loading code (one of the
reasons there is BootUp.boot() and BootUp.bootUnsafe()) but doesn't
help us with our problem.

> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 
> 0001-the-stupid-version-of-hadoop-support.patch, 
> 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 
> 0002-CASSANDRA-342.-Working-hadoop-support.patch, 
> 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 
> 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 
> 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 
> v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: 
> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%[email protected]%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-342) hadoop integration

Reply via email to