Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran.
The comment on this change is: starting workshop notes.
http://wiki.apache.org/hadoop/BristolHadoopWorkshopSpring2010

--------------------------------------------------

New page:
= Bristol Hadoop Workshop Spring 2010 =

This was a one-day event hosted by HP Laboratories, Bristol, and co-organised 
by HPLabs and Bristol University. It was a followup to the 
[[BristolHadoopWorkshop|2009 workshop]], again a meeting of locals to discuss 
what they were up to and look at Hadoop in physics, among other things.

== Julien Nioche: Behemoth ==

Julien Nioche at [[http://www.digitalpebble.com/|digitalPebble]] has been 
working on Natural Language Processing at scale.
 * Started with Apache UIMA: fairly simple
 * Now working on Behemoth, "Hadoop's evil twin":not a nice elephant at all
The goal is large scale document analysis based on Hadoop; to let you deploy 
GATE or UIMA applications on Hadoop clusters. It was driven by the need to 
implement this for more than one client client, opened it up to avoid writing 
from scratch every time.

Workflow: load to HDFS, import to Behemoth Doc format (PDF, HTML, WARC, Nutch 
segments, etc. uses Apache Tika to extract text and metadata). Output 
(key==URI, value=BehemothDocument)

Features
 * Common ground between UIMA and Gate (Sheffield university closed source)
 * Supports different (non-Java) annotators
 * Easy to configure using the Hadoop config file format and Behemoth/UIMA 
rules in JARs
 * Works on Hadoop the ecosystem

Demo: shows that the jobtracker JSP file has been extended with GATE metrics.

Future work: cascading support and Avro for cross-language code, SOLR and 
Mahout. It needs to be tested at scale. Run @200K documents so far, Julien 
would be interested in anyone with a datacentre and an NLP problem.

Reply via email to