Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran. The comment on this change is: HEP. http://wiki.apache.org/hadoop/BristolHadoopWorkshopSpring2010?action=diff&rev1=1&rev2=2 -------------------------------------------------- * Easy to configure using the Hadoop config file format and Behemoth/UIMA rules in JARs * Works on Hadoop the ecosystem - Demo: shows that the jobtracker JSP file has been extended with GATE metrics. + Demo: shows that the JobTracker JSP page had been extended with GATE metrics. Future work: cascading support and Avro for cross-language code, SOLR and Mahout. It needs to be tested at scale. Run @200K documents so far, Julien would be interested in anyone with a datacentre and an NLP problem. + == James Jackson: Hadoop and High Energy Physics == + + James is from CERN and the CMS experiment -he spoke about ongoing work exploring using Hadoop for HEP event mining. + + The LHC experiments -Atlas, CMS, etc- generate event data, most of which is uninteresting. Physics events can be split into + * Uninteresting and known physics + * Unknown and uninteresting. We don't have the theory ready for these events yet + * Unknown and interesting: stuff people are looking for that matches (somewhat) the current theories, gives you Nobel prizes and the like. + + To make life complicated there is a lot of noise on the detectors, timing problems can have stuff come in out of order. You need to do a lot of filtering and look for signals a long way off random noise before you can declare that you've found something interesting. + + Most physicists not only code as if they were writing FORTRAN, they never wrote good FORTRAN either. (this is a complaint by [[http://www.cs.utoronto.ca/~gvwilson/|Greg Wilson in Toronto]] - the computing departments never teach software engineering to all the scientists who are expected to code as part of their day to day science). + + HDFS has been used as a filestore in some of the US CMS Tier-2 sites, the new work that James discussed was that of actually treating physics problems as MapReduce jobs. They are bringing up a cluster of machines with storage for this, but would also like to use idle CPU time on other machines in the datacentre -there was some discussion on how to do this MAPREDUCE-1603 is now a feature request asking for a way to make the assessing of availability a feature that supported plugins. This would allow someone to write something that looked at non-Hadoop workload of machines and reduced the number Hadoop slots to report as being available when busy with other work. +
