[jira] Commented: (HADOOP-2781) Hadoop/Groovy integration

Ted Dunning (JIRA) Wed, 27 Feb 2008 18:47:52 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573135#action_12573135
 ]


Ted Dunning commented on HADOOP-2781:
-------------------------------------

Here is the README for grool, a standalone package I will upload shortly.

OVERVIEW

Grool is a simple extension to the Groovy scripting language.  Itis intended to 
make
it easier to use Hadoop from an scripted environment and to make it possible to 
write
simple map-reduce programs in functional style with much less boiler-plate code 
than
is typically required.  Essentially, the goal is to make simple programs simple 
to write.

For instance, the venerable word-count example boils down to the following 
using grool:

    count = Hadoop.mr({key, text, out, reporter ->
        text.split(" ").each {
            out.collect(it, 1)
        }
    }, {word, counts, out, reporter ->
        int sum = 0
        counts.each { sum += it }
        out.collect(word, sum)
    })

When we call the function count, it will determine whether its input is local 
or already
in the Hadoop file system and will invoke a map-reduce program.  The location 
of the output
is returned packaged in an object that can be passed to other map-reduce 
functions like count
or read directly using familiar Groovy code constructs.  For instance, to count 
some literal
text and print the results, we can do this:

    count(["this is some test", "data for a simple", "test of a 
program"]).eachLine {
       println(it)
    }

As an example of composition of functions, we can write a simple variant of the 
word
counting program that counts the prevalence of counts in each decade (1-10, 
10-100
and so on).  This can be written this way

    decade = Hadoop.mr({key, text, out, reporter ->
        out.collect(Math.floor(Math.log(new Integer(text.split("\t")[1]))), 1)
    }, {decade, counts, out, reporter ->
        int sum = 0; counts.each { sum += it }
        out.collect(decade, sum)
    })

These two programs can be composed and the result printed using familiar 
functional style:

    decade(count(text)).eachLine{
        println it
    }

When we are done, we can clean up any temporary files in the Hadoop file system 
using the
following command:

    Hadoop.cleanup()

If we were to write code that needs line by line debugging, we can change any 
invocation of
Hadoop.mr to Local.mr and the code will be executing locally rather than using 
Hadoop.  Local
and Hadoop based map reduce functions can be intermixed arbitrarily and 
transparently.

KNOWN ISSUES and LIMITATIONS

As it stands, grool is useful for some simple tasks, but it badly needs 
extending.  Some
of the limitations include:

  - all input is read using TextInputFormat (easily changed)

  - combiners, partition functions and key sorting aren't supported yet (easily 
changed)

  - there is essentially no documentation and very little in the way of test 
code.  The
  current code is more of a proof of concept than a serious tool.

  - the current API doesn't allow multiple input files beyond what a single 
glob expression
  can specify (the current argument parsing code is fugly and should be 
replaced)

  - there is some speed penalty due to the type conversions performed at the 
Java/Groovy
  interface and because Groovy code can be slower than pure Java.  Currently 
the penalty
  appears to be at most about 2-3x for applications like log parsing.  In my 
experience, this
  is less than the cost of using, say, Pig.  (improving this may be very 
difficult to do
  without sacrificing the simplicity of writing grool code, but you can provide 
a Java
  mapper or reducer if you like to avoid essentially all of this overhead)

  - there are bound to be several horrid infelicities in the API as it stands 
that make grool
  really hard to use in important cases.  This is bound to lead significant and 
incompatible
  changes in the API.  Chief among the suspect decisions is the fact that 
Hadoop.mr returns
  a closure instead of an object with interesting methods.

HOW IT WORKS

The fundamental difficulty in writing something like grool is that it is hard 
to write a
script that executes code remotely without expressing the remote code as a 
string.  If you
do that, then you lose all of the syntax checking that the language.

I wanted to use a language like Python or Groovy, but I wasn't willing to give 
up being
able to call the map function easily for testing before composing it into a 
map-reduce
function.  Unfortunately, languages like Groovy and Python don't provide access 
to the
source of functions or closures and, at least in Groovy's case, these functions 
may
refer to variables outside of the scope of the function which would mean that 
the text
of the function wouldn't make sense on the remote node anyway.

The solution used by grool to get around this difficulty is to execute the 
ENTIRE script
multiple times on different machines.  Depending on where the script is 
executing, the
semantics of the functions being executed differs.  When the script is executed 
initially
on the local machine, it primarily copies data to and from the Hadoop file 
system,
decides what the names of temporary files should be, configures Hadoop jobs and 
invokes
them.  When executed by a mapper configure method, the script executes all 
initialization
code in the script and then saves references to all of mapper functions defined 
in the
script so that the map method can invoke the correct function.  Similar actions 
occur
in the configure method of each reducer.

This can be a bit confusing because references to external variables don't 
really refer
to the same variables.  In addition, some code only makes sense to execute in 
the correct
environment.  Mostly, this involves code such as writing results which is 
handled by a
small hack where all map-reduce functions return a reference to a null file 
when executed
by mappers or reducers.  That makes any output loops such as in the simple 
examples above
execute only a single time.  Some other code may require more care.  To help 
with those
cases, there is a function Hadoop.local that accepts a closure to be executed 
only on
the original machine.




> Hadoop/Groovy integration
> -------------------------
>
>                 Key: HADOOP-2781
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2781
>             Project: Hadoop Core
>          Issue Type: New Feature
>         Environment: Any
>            Reporter: Ted Dunning
>             Fix For: 0.17.0
>
>
> This is a place-holder issue to hold initial release of the groovy 
> integration for hadoop.
> The goal is to be able to write very simple map-reduce programs in just a few 
> lines of code in a functional style.  Word count should be less than 5 lines 
> of code! 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2781) Hadoop/Groovy integration

Reply via email to