[
https://issues.apache.org/jira/browse/HADOOP-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573135#action_12573135
]
Ted Dunning commented on HADOOP-2781:
-------------------------------------
Here is the README for grool, a standalone package I will upload shortly.
OVERVIEW
Grool is a simple extension to the Groovy scripting language. Itis intended to
make
it easier to use Hadoop from an scripted environment and to make it possible to
write
simple map-reduce programs in functional style with much less boiler-plate code
than
is typically required. Essentially, the goal is to make simple programs simple
to write.
For instance, the venerable word-count example boils down to the following
using grool:
count = Hadoop.mr({key, text, out, reporter ->
text.split(" ").each {
out.collect(it, 1)
}
}, {word, counts, out, reporter ->
int sum = 0
counts.each { sum += it }
out.collect(word, sum)
})
When we call the function count, it will determine whether its input is local
or already
in the Hadoop file system and will invoke a map-reduce program. The location
of the output
is returned packaged in an object that can be passed to other map-reduce
functions like count
or read directly using familiar Groovy code constructs. For instance, to count
some literal
text and print the results, we can do this:
count(["this is some test", "data for a simple", "test of a
program"]).eachLine {
println(it)
}
As an example of composition of functions, we can write a simple variant of the
word
counting program that counts the prevalence of counts in each decade (1-10,
10-100
and so on). This can be written this way
decade = Hadoop.mr({key, text, out, reporter ->
out.collect(Math.floor(Math.log(new Integer(text.split("\t")[1]))), 1)
}, {decade, counts, out, reporter ->
int sum = 0; counts.each { sum += it }
out.collect(decade, sum)
})
These two programs can be composed and the result printed using familiar
functional style:
decade(count(text)).eachLine{
println it
}
When we are done, we can clean up any temporary files in the Hadoop file system
using the
following command:
Hadoop.cleanup()
If we were to write code that needs line by line debugging, we can change any
invocation of
Hadoop.mr to Local.mr and the code will be executing locally rather than using
Hadoop. Local
and Hadoop based map reduce functions can be intermixed arbitrarily and
transparently.
KNOWN ISSUES and LIMITATIONS
As it stands, grool is useful for some simple tasks, but it badly needs
extending. Some
of the limitations include:
- all input is read using TextInputFormat (easily changed)
- combiners, partition functions and key sorting aren't supported yet (easily
changed)
- there is essentially no documentation and very little in the way of test
code. The
current code is more of a proof of concept than a serious tool.
- the current API doesn't allow multiple input files beyond what a single
glob expression
can specify (the current argument parsing code is fugly and should be
replaced)
- there is some speed penalty due to the type conversions performed at the
Java/Groovy
interface and because Groovy code can be slower than pure Java. Currently
the penalty
appears to be at most about 2-3x for applications like log parsing. In my
experience, this
is less than the cost of using, say, Pig. (improving this may be very
difficult to do
without sacrificing the simplicity of writing grool code, but you can provide
a Java
mapper or reducer if you like to avoid essentially all of this overhead)
- there are bound to be several horrid infelicities in the API as it stands
that make grool
really hard to use in important cases. This is bound to lead significant and
incompatible
changes in the API. Chief among the suspect decisions is the fact that
Hadoop.mr returns
a closure instead of an object with interesting methods.
HOW IT WORKS
The fundamental difficulty in writing something like grool is that it is hard
to write a
script that executes code remotely without expressing the remote code as a
string. If you
do that, then you lose all of the syntax checking that the language.
I wanted to use a language like Python or Groovy, but I wasn't willing to give
up being
able to call the map function easily for testing before composing it into a
map-reduce
function. Unfortunately, languages like Groovy and Python don't provide access
to the
source of functions or closures and, at least in Groovy's case, these functions
may
refer to variables outside of the scope of the function which would mean that
the text
of the function wouldn't make sense on the remote node anyway.
The solution used by grool to get around this difficulty is to execute the
ENTIRE script
multiple times on different machines. Depending on where the script is
executing, the
semantics of the functions being executed differs. When the script is executed
initially
on the local machine, it primarily copies data to and from the Hadoop file
system,
decides what the names of temporary files should be, configures Hadoop jobs and
invokes
them. When executed by a mapper configure method, the script executes all
initialization
code in the script and then saves references to all of mapper functions defined
in the
script so that the map method can invoke the correct function. Similar actions
occur
in the configure method of each reducer.
This can be a bit confusing because references to external variables don't
really refer
to the same variables. In addition, some code only makes sense to execute in
the correct
environment. Mostly, this involves code such as writing results which is
handled by a
small hack where all map-reduce functions return a reference to a null file
when executed
by mappers or reducers. That makes any output loops such as in the simple
examples above
execute only a single time. Some other code may require more care. To help
with those
cases, there is a function Hadoop.local that accepts a closure to be executed
only on
the original machine.
> Hadoop/Groovy integration
> -------------------------
>
> Key: HADOOP-2781
> URL: https://issues.apache.org/jira/browse/HADOOP-2781
> Project: Hadoop Core
> Issue Type: New Feature
> Environment: Any
> Reporter: Ted Dunning
> Fix For: 0.17.0
>
>
> This is a place-holder issue to hold initial release of the groovy
> integration for hadoop.
> The goal is to be able to write very simple map-reduce programs in just a few
> lines of code in a functional style. Word count should be less than 5 lines
> of code!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.