[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Jeff Hodges (JIRA) Thu, 20 Aug 2009 22:56:43 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745809#action_12745809
 ]


Jeff Hodges edited comment on CASSANDRA-342 at 8/20/09 10:55 PM:
-----------------------------------------------------------------


Okay, before we talk about the boot code, let me address some of the confusion 
about Hadoop.

In Hadoop, there are things called Jobs, which are a combination of a Map and a 
Reduce operation and the InputFormat configuration you specify which are then 
run across a bunch of machines.

A Task is an individual Map or Reduce operation run on one of those machines 
(so every Job has many Tasks). For every new Task needed, a new JVM is booted 
up.[1]

This is actually okay, distributed-systems-wise, because it keeps all the Tasks 
from interfering with one another.

It does, however, make our jobs harder. There is no way for a Task (and thus 
this Hadoop code in these patches) to access the runtime of a Cassandra node 
already on the machine because they will be in separate JVMs!

HBase, as I mentioned above, solves this problem by first starting up HBase on 
those remote machines, and then having each Task create an HTable object from 
the InputSplit handed to it. This HTable object connects to the local HBase 
process. (Of course, this same thing happens in the JVM that creates the 
InputSplits.)

So, here's my deal. There is no way for this currently designed system to work 
efficiently in a distributed system. This is because we have to boot a brand 
new Cassandra process on machines that might already have (and need if hardware 
is limited) one running already. The boot up time for Cassandra alone is a big 
time sink. And consider how these nodes would interoperate with the "stable", 
non-Hadoop nodes that would start sending them data. Ugh.

We can avoid all of this boot time drama if we can come up with a good way of 
remotely accessing all of the internal information we need from the Cassandra 
node already running. I have not been able to come up with an alternative 
solution.

Comments?

[1] There is something called "Task reuse" that can be configured into a Hadoop 
deployment. However, the "reuse" only means that a Task can be used more than 
once for the same Job. So, it's basically just
another complicating factor in our boot loading code (one of the reasons there 
is BootUp.boot() and BootUp.bootUnsafe()) but doesn't help us with our problem.

      was (Author: jmhodges):
    
Okay, before we talk about the boot code, let me address some of the
confusion about Hadoop.

In Hadoop, there are things called Jobs, which are a combination of a
Map and a Reduce operation and the InputFormat configuration you
specify which are then run across a bunch of machines.

A Task is an individual Map or Reduce operation run on one of those
machines (so every Job has many Tasks). For every new Task needed, a
new JVM is booted up.[1]

This is actually okay, distributed-systems-wise, because it keeps all
the Tasks from interfering with one another.

It does, however, make our jobs harder. There is no way for a Task
(and thus this Hadoop code in these patches) to access the runtime of
a Cassandra node already on the machine because they will be in
separate JVMs!

HBase, as I mentioned above, solves this problem by first starting up
HBase on those remote machines, and then having each Task create an
HTable object from the InputSplit handed to it. This HTable object
connects to the local HBase process. (Of course, this same thing
happens in the JVM that creates the InputSplits.)

So, here's my deal. There is no way for this currently designed system
to work efficiently in a distributed system. This is because we have
to boot a brand new Cassandra process on machines that might already
have (and need if hardware is limited) one running already. The boot
up time for Cassandra alone is a big time sink. And consider how these
nodes would interoperate with the "stable", non-Hadoop nodes that
would start sending them data. Ugh.

We can avoid all of this boot time drama if we can come up with a
good way of remotely accessing all of the internal information we need
from the Cassandra node already running. I have not been able to come
up with an alternative solution.

Comments?

[1] There is something called "Task reuse" that can be configured into
a Hadoop deployment. However, the "reuse" only means that a Task can
be used more than once for the same Job. So, it's basically just
another complicating factor in our boot loading code (one of the
reasons there is BootUp.boot() and BootUp.bootUnsafe()) but doesn't
help us with our problem.
  
> hadoop integration
> ------------------
>
>                 Key: CASSANDRA-342
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-342
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>         Attachments: 0001-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 
> 0001-the-stupid-version-of-hadoop-support.patch, 
> 0001-v3-CASSANDRA-342.-Set-up-for-the-hadoop-commits.patch, 
> 0002-CASSANDRA-342.-Working-hadoop-support.patch, 
> 0002-v3-CASSANDRA-342.-Working-hadoop-support.patch, 
> 0003-CASSANDRA-342.-Adding-the-WordCount-example.patch, 
> 0003-v3-CASSANDRA-342.-Adding-the-WordCount-example.patch, 
> v2-squashed-commits-for-hadoop-stupid.patch
>
>
> Some discussion on -dev: 
> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%[email protected]%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-342) hadoop integration

Reply via email to