Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran. The comment on this change is: starting workshop notes. http://wiki.apache.org/hadoop/BristolHadoopWorkshopSpring2010 -------------------------------------------------- New page: = Bristol Hadoop Workshop Spring 2010 = This was a one-day event hosted by HP Laboratories, Bristol, and co-organised by HPLabs and Bristol University. It was a followup to the [[BristolHadoopWorkshop|2009 workshop]], again a meeting of locals to discuss what they were up to and look at Hadoop in physics, among other things. == Julien Nioche: Behemoth == Julien Nioche at [[http://www.digitalpebble.com/|digitalPebble]] has been working on Natural Language Processing at scale. * Started with Apache UIMA: fairly simple * Now working on Behemoth, "Hadoop's evil twin":not a nice elephant at all The goal is large scale document analysis based on Hadoop; to let you deploy GATE or UIMA applications on Hadoop clusters. It was driven by the need to implement this for more than one client client, opened it up to avoid writing from scratch every time. Workflow: load to HDFS, import to Behemoth Doc format (PDF, HTML, WARC, Nutch segments, etc. uses Apache Tika to extract text and metadata). Output (key==URI, value=BehemothDocument) Features * Common ground between UIMA and Gate (Sheffield university closed source) * Supports different (non-Java) annotators * Easy to configure using the Hadoop config file format and Behemoth/UIMA rules in JARs * Works on Hadoop the ecosystem Demo: shows that the jobtracker JSP file has been extended with GATE metrics. Future work: cascading support and Avro for cross-language code, SOLR and Mahout. It needs to be tested at scale. Run @200K documents so far, Julien would be interested in anyone with a datacentre and an NLP problem.
