Re: ongoing · Wide Finder 2

Chris K Wensel Thu, 01 May 2008 15:35:10 -0700

or Cascading (+Groovy).

should have a release of my Groovy Cascading builder by this weekend...

def APACHE_COMMON_REGEX = /^([^ ]*) +[^ ]* +[^ ]* +\[([^]]*)\] +\"([^ ]*) ([^ ]*) [^ ]*\" ([^ ]*) ([^ ]*).*$/

def APACHE_COMMON_GROUPS = [1, 2, 3, 4, 5, 6]

def APACHE_COMMON_FIELDS = ["ip", "time", "method", "url", "status","size"]

def URL_PATTERN = /\/ongoing\/When\/\d\d\dx\/\d\d\d\d\/\d\d\/\d\d\/[^ .]+/


def cascading = new Cascading()
def builder = cascading.builder();

Flow flow = builder.flow("widefinder")
  {
    source(input, scheme: text())

    // parse apache log

regexParser(pattern: APACHE_COMMON_REGEX, groups:APACHE_COMMON_GROUPS, declared: APACHE_COMMON_FIELDS )


    // throw away tuples that don't match
    filter(arguments:["url"], pattern:URL_PATTERN)

    // throw away unused fields
    project(arguments:["url"])

    group(groupBy:["url"])

    // creates 'count' field, by default
    count()

    // group/sort on 'count', reverse the sort order
    group(["count"], reverse: true)

    sink(output, delete: true)
  }

flow.complete() // execute the flow


On May 1, 2008, at 2:12 PM, Doug Cutting wrote:

Anyone want to play? The goal is to find a small program thatquickly computes some statistics over 45GB of log data on a 32-corebox. Hadoop seems like a good candidate. Streaming? Pig? Java?
http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2

Doug


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/

Re: ongoing · Wide Finder 2

Reply via email to