or Cascading (+Groovy).
should have a release of my Groovy Cascading builder by this weekend...
def APACHE_COMMON_REGEX = /^([^ ]*) +[^ ]* +[^ ]* +\[([^]]*)\] +
\"([^ ]*) ([^ ]*) [^ ]*\" ([^ ]*) ([^ ]*).*$/
def APACHE_COMMON_GROUPS = [1, 2, 3, 4, 5, 6]
def APACHE_COMMON_FIELDS = ["ip", "time", "method", "url", "status",
"size"]
def URL_PATTERN = /\/ongoing\/When\/\d\d\dx\/\d\d\d\d\/\d\d\/\d\d\/
[^ .]+/
def cascading = new Cascading()
def builder = cascading.builder();
Flow flow = builder.flow("widefinder")
{
source(input, scheme: text())
// parse apache log
regexParser(pattern: APACHE_COMMON_REGEX, groups:
APACHE_COMMON_GROUPS, declared: APACHE_COMMON_FIELDS )
// throw away tuples that don't match
filter(arguments:["url"], pattern:URL_PATTERN)
// throw away unused fields
project(arguments:["url"])
group(groupBy:["url"])
// creates 'count' field, by default
count()
// group/sort on 'count', reverse the sort order
group(["count"], reverse: true)
sink(output, delete: true)
}
flow.complete() // execute the flow
On May 1, 2008, at 2:12 PM, Doug Cutting wrote:
Anyone want to play? The goal is to find a small program that
quickly computes some statistics over 45GB of log data on a 32-core
box. Hadoop seems like a good candidate. Streaming? Pig? Java?
http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2
Doug
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/