Re: [infinispan-dev] Infinispan - Hadoop integration

Alan Field Fri, 14 Mar 2014 11:23:02 -0700

Hey,

First off, I think integrating Infinispan with Hadoop using the InputFormat and 
OutputFormat interfaces is a really good idea, instead of using the file system 
interfaces. That's the approach that Gluster is using, but I don't think it's 
ideal for Infinispan.


> So to clarify, we will have a cluster of nodes where each node contains two
> JVM, one running an Hadoop process, one running an Infinispan process. The
> Hadoop process would only read the data from the Infinispan process in the
> same node during a normal M/R execution.

A regular non-master Hadoop node is running a DataNode daemon and a TaskTracker 
daemon in separate JVMs. Each Map/Reduce Task is also executed in a separate 
JVM. In a Hadoop 2.0, the TaskTracker daemon is replaced by a NodeManager 
daemon. This is the minimal number of JVM processes needed by a node, and there 
will be more if other services are running in the cluster. The first point I am 
trying to make is that there are many different JVM processes running on a 
Hadoop node. Then I am trying to understand which JVM process you are talking 
about when you say the "Hadoop process"? I *think* this is referring to the 
actual Map/Reduce Task, but I'm not sure. 

I would also say that trying to figure out how to integrate Infinispan with 
Apache Spark [1] would also be interesting. I don't know as much about it, but 
a lot of processing that is currently being performed in Map/Reduce will be 
migrated to Spark. It's certainly being billed as the replacement, or at least 
the future of Map/Reduce.

Thanks,
Alan

[1] https://spark.apache.org/

----- Original Message -----
> From: "Mircea Markus" <[email protected]>
> To: "infinispan -Dev List" <[email protected]>
> Sent: Friday, March 14, 2014 11:26:03 AM
> Subject: Re: [infinispan-dev] Infinispan - Hadoop integration
> 
> 
> On Mar 14, 2014, at 9:06, Emmanuel Bernard <[email protected]> wrote:
> 
> > 
> > 
> >> On 13 mars 2014, at 23:39, Sanne Grinovero <[email protected]> wrote:
> >> 
> >>> On 13 March 2014 22:19, Mircea Markus <[email protected]> wrote:
> >>> 
> >>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <[email protected]> wrote:
> >>>> 
> >>>>> On 13 March 2014 22:05, Mircea Markus <[email protected]> wrote:
> >>>>> 
> >>>>> On Mar 13, 2014, at 20:59, Ales Justin <[email protected]> wrote:
> >>>>> 
> >>>>>>> - also important to notice that we will have both an Hadoop and an
> >>>>>>> Infinispan cluster running in parallel: the user will interact with
> >>>>>>> the former in order to run M/R tasks. Hadoop will use Infinispan
> >>>>>>> (integration achieved through InputFormat and OutputFormat ) in
> >>>>>>> order to get the data to be processed.
> >>>>>> 
> >>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as
> >>>>>> well -- hence 1JVM?
> >>>>> 
> >>>>> good point, ideally it should be a single VM: reduced serialization
> >>>>> cost (in vm access) and simpler architecture. That's if you're not
> >>>>> using C/S mode, of course.
> >>>> 
> >>>> ?
> >>>> Don't try confusing us again on that :-)
> >>>> I think we agreed that the job would *always* run in strict locality
> >>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
> >>>> would be connecting from somewhere else but that's unrelated.
> >>> 
> >>> we did discuss the possibility of running it over hotrod though, do you
> >>> see a problem with that?
> >> 
> >> No of course not, we discussed that. I just mean I think that needs to
> >> be clarified on the list that the Hadoop engine will always run in the
> >> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
> >> native clients, or Hadoop clients over Hot Rod) can indeed connect
> >> remotely, but it's important to clarify that the processing itself
> >> will take advantage of locality in all configurations. In other words,
> >> to clarify that the serialization cost you mention for clients is just
> >> to transfer the job definition and optionally the final processing
> >> result.
> >> 
> > 
> > Not quite. The serialization cost Mircea mentions I think is between the
> > Hadoop vm and the Infinispan vm on a single node. The serialization does
> > not require network traffic but is still shuffling data between two
> > processes basically. We could eliminate this by starting both Hadoop and
> > Infinispan from the same VM but that requires more work than necessary for
> > a prototype.
> 
> thanks for the clarification, indeed this is the serialization overhead I had
> in mind.
> 
> > 
> > So to clarify, we will have a cluster of nodes where each node contains two
> > JVM, one running an Hadoop process, one running an Infinispan process. The
> > Hadoop process would only read the data from the Infinispan process in the
> > same node during a normal M/R execution.
> 
> Cheers,
> --
> Mircea Markus
> Infinispan lead (www.infinispan.org)
> 
> 
> 
> 
> 
> _______________________________________________
> infinispan-dev mailing list
> [email protected]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Infinispan - Hadoop integration

Reply via email to