I think, a prerequisite for implementing FlumeJava is to improve JobControl to allow DAGs of Hadoop jobs such that independent jobs can be executed in parallel. It also needs to be enriched with intermediate data management.
A simpler alternative would be to implement FlumeJava on top of Oozie. Ideally, FlumeJava should be a Pig backend. ----- Original Message ----- From: Jeff Hammerbacher (JIRA) <[email protected]> To: [email protected] <[email protected]> Sent: Thu Jun 10 08:31:18 2010 Subject: [jira] Commented: (MAPREDUCE-1849) Implement a FlumeJava-like library for operations over parallel collections using Hadoop MapReduce [ https://issues.apache.org/jira/browse/MAPREDUCE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877451#action_12877451 ] Jeff Hammerbacher commented on MAPREDUCE-1849: ---------------------------------------------- Owen: sure. They provide "derived operators" as well, like count(), join(), and top(). The main difference from Pig seems to be allowing users to work in Java. In fact, the Google team initially implemented their approach in a new language called Lumberjack, but mentions that, among other things, the implementation of a new language was a lot of work, and most importantly, novelty is an obstacle to adoption. They settled on Java and seem to have had some internal success. > Implement a FlumeJava-like library for operations over parallel collections > using Hadoop MapReduce > -------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-1849 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1849 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Jeff Hammerbacher > > The API used internally at Google is described in great detail at > http://portal.acm.org/citation.cfm?id=1806596.1806638. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
