[infinispan-dev] Hadoop and ISPN first and next steps

Gustavo Fernandes Mon, 23 Jun 2014 02:05:27 -0700

Hi all,

Last week Pedro, myself and Mircea met at London to start prototyping the 
integration between Hadoop and ISPN.


We discussed several scenarios where Hadoop and ISPN would be able to work 
together, and decided to start with ISPN server as the source and/or sink for a 
Hadoop Map Reduce job

After creating an InputFormat and OutputFormat for ISPN [1], we generated some 
data [2] and run a sample job [3] using Hadoop v1.x, both in docker [4] and on 
a 4 node physical cluster (installed with the help of puppet [5]) 

We also run the same job in the same cluster with the same data, but using HDFS 
as data source and sink, so that we could verify correctness.

In this setup, each Hadoop slave runs the TaskTracker, Data node and ISPN 
server, and the idea was to generate a split [6] based on segments and redirect 
the map task to be executed on the nodes associated with those segments. This 
routing and filtering the data is still work in progress, carried on by Pedro.

Next steps? 

- For sure optimise the current Input/OutputFormat so that it can efficiently 
read/write data. This will allow ISPN to become part of the Hadoop ecosystem 
and easier to integrate it with tools like Apache Hive [7] or Pig [8].  
- Investigate closer integration for Map Reduce, potentially usable in library 
mode. As you might know, YARN (the overhaul of Hadoop architecture) is not only 
about Map Reduce, and it offers more extensions points than Hadoop Map Reduce v1
- I read with great interest the Spark paper [9]. Spark provides a DSL with 
functional language constructs like map, flatMap and filter to process 
distributed data in memory. In this scenario, Map Reduce is just a special case 
achieved by chaining functions [10]. As Spark is much more than Map Reduce, and 
can run many machine learning algorithms efficiently, I was wondering if we 
should shift attention to Spark rather than focusing too much on Map Reduce. 
Thoughts?


[1] 
https://github.com/pruivo/infinispan-hadoop-integration/tree/master/src/main/java/org/infinispan/hadoopintegration/mapreduce
[2] 
http://www.skorks.com/2010/03/how-to-quickly-generate-a-large-file-on-the-command-line-with-linux/
[3] 
https://github.com/pruivo/hadoop-wordcount-example/tree/master/src/main/java/com/gustavonalle/hadoop
[4] https://github.com/gustavonalle/docker/tree/master/hadoop
[5] https://gist.github.com/gustavonalle/95dfdd771f31e1e2bf9d
[6] 
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html
[7] https://hive.apache.org/
[8] http://pig.apache.org/
[9] http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
[10] https://spark.apache.org/docs/0.9.0/quick-start.html#more-on-rdd-operations
   
Cheers,
Gustavo

_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

[infinispan-dev] Hadoop and ISPN first and next steps

Reply via email to