Hey, First off, I think integrating Infinispan with Hadoop using the InputFormat and OutputFormat interfaces is a really good idea, instead of using the file system interfaces. That's the approach that Gluster is using, but I don't think it's ideal for Infinispan.
> So to clarify, we will have a cluster of nodes where each node contains two > JVM, one running an Hadoop process, one running an Infinispan process. The > Hadoop process would only read the data from the Infinispan process in the > same node during a normal M/R execution. A regular non-master Hadoop node is running a DataNode daemon and a TaskTracker daemon in separate JVMs. Each Map/Reduce Task is also executed in a separate JVM. In a Hadoop 2.0, the TaskTracker daemon is replaced by a NodeManager daemon. This is the minimal number of JVM processes needed by a node, and there will be more if other services are running in the cluster. The first point I am trying to make is that there are many different JVM processes running on a Hadoop node. Then I am trying to understand which JVM process you are talking about when you say the "Hadoop process"? I *think* this is referring to the actual Map/Reduce Task, but I'm not sure. I would also say that trying to figure out how to integrate Infinispan with Apache Spark [1] would also be interesting. I don't know as much about it, but a lot of processing that is currently being performed in Map/Reduce will be migrated to Spark. It's certainly being billed as the replacement, or at least the future of Map/Reduce. Thanks, Alan [1] https://spark.apache.org/ ----- Original Message ----- > From: "Mircea Markus" <[email protected]> > To: "infinispan -Dev List" <[email protected]> > Sent: Friday, March 14, 2014 11:26:03 AM > Subject: Re: [infinispan-dev] Infinispan - Hadoop integration > > > On Mar 14, 2014, at 9:06, Emmanuel Bernard <[email protected]> wrote: > > > > > > >> On 13 mars 2014, at 23:39, Sanne Grinovero <[email protected]> wrote: > >> > >>> On 13 March 2014 22:19, Mircea Markus <[email protected]> wrote: > >>> > >>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <[email protected]> wrote: > >>>> > >>>>> On 13 March 2014 22:05, Mircea Markus <[email protected]> wrote: > >>>>> > >>>>> On Mar 13, 2014, at 20:59, Ales Justin <[email protected]> wrote: > >>>>> > >>>>>>> - also important to notice that we will have both an Hadoop and an > >>>>>>> Infinispan cluster running in parallel: the user will interact with > >>>>>>> the former in order to run M/R tasks. Hadoop will use Infinispan > >>>>>>> (integration achieved through InputFormat and OutputFormat ) in > >>>>>>> order to get the data to be processed. > >>>>>> > >>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as > >>>>>> well -- hence 1JVM? > >>>>> > >>>>> good point, ideally it should be a single VM: reduced serialization > >>>>> cost (in vm access) and simpler architecture. That's if you're not > >>>>> using C/S mode, of course. > >>>> > >>>> ? > >>>> Don't try confusing us again on that :-) > >>>> I think we agreed that the job would *always* run in strict locality > >>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client > >>>> would be connecting from somewhere else but that's unrelated. > >>> > >>> we did discuss the possibility of running it over hotrod though, do you > >>> see a problem with that? > >> > >> No of course not, we discussed that. I just mean I think that needs to > >> be clarified on the list that the Hadoop engine will always run in the > >> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop > >> native clients, or Hadoop clients over Hot Rod) can indeed connect > >> remotely, but it's important to clarify that the processing itself > >> will take advantage of locality in all configurations. In other words, > >> to clarify that the serialization cost you mention for clients is just > >> to transfer the job definition and optionally the final processing > >> result. > >> > > > > Not quite. The serialization cost Mircea mentions I think is between the > > Hadoop vm and the Infinispan vm on a single node. The serialization does > > not require network traffic but is still shuffling data between two > > processes basically. We could eliminate this by starting both Hadoop and > > Infinispan from the same VM but that requires more work than necessary for > > a prototype. > > thanks for the clarification, indeed this is the serialization overhead I had > in mind. > > > > > So to clarify, we will have a cluster of nodes where each node contains two > > JVM, one running an Hadoop process, one running an Infinispan process. The > > Hadoop process would only read the data from the Infinispan process in the > > same node during a normal M/R execution. > > Cheers, > -- > Mircea Markus > Infinispan lead (www.infinispan.org) > > > > > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
