or Cascading (+Groovy).

should have a release of my Groovy Cascading builder by this weekend...

def APACHE_COMMON_REGEX = /^([^ ]*) +[^ ]* +[^ ]* +\[([^]]*)\] + \"([^ ]*) ([^ ]*) [^ ]*\" ([^ ]*) ([^ ]*).*$/
def APACHE_COMMON_GROUPS = [1, 2, 3, 4, 5, 6]
def APACHE_COMMON_FIELDS = ["ip", "time", "method", "url", "status", "size"]

def URL_PATTERN = /\/ongoing\/When\/\d\d\dx\/\d\d\d\d\/\d\d\/\d\d\/ [^ .]+/

def cascading = new Cascading()
def builder = cascading.builder();

Flow flow = builder.flow("widefinder")
  {
    source(input, scheme: text())

    // parse apache log
regexParser(pattern: APACHE_COMMON_REGEX, groups: APACHE_COMMON_GROUPS, declared: APACHE_COMMON_FIELDS )

    // throw away tuples that don't match
    filter(arguments:["url"], pattern:URL_PATTERN)

    // throw away unused fields
    project(arguments:["url"])

    group(groupBy:["url"])

    // creates 'count' field, by default
    count()

    // group/sort on 'count', reverse the sort order
    group(["count"], reverse: true)

    sink(output, delete: true)
  }

flow.complete() // execute the flow


On May 1, 2008, at 2:12 PM, Doug Cutting wrote:

Anyone want to play? The goal is to find a small program that quickly computes some statistics over 45GB of log data on a 32-core box. Hadoop seems like a good candidate. Streaming? Pig? Java?

http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2

Doug

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




Reply via email to