Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran. The comment on this change is: Sanders' talk. http://wiki.apache.org/hadoop/BristolHadoopWorkshopSpring2010?action=diff&rev1=3&rev2=4 -------------------------------------------------- This was a one-day event hosted by HP Laboratories, Bristol, and co-organised by HPLabs and Bristol University. It was a followup to the [[BristolHadoopWorkshop|2009 workshop]], again a meeting of locals to discuss what they were up to and look at Hadoop in physics, among other things. == Julien Nioche: Behemoth == + + [[http://www.slideshare.net/steve_l/digital-pebble-behemoth | Slides]] Julien Nioche at [[http://www.digitalpebble.com/|digitalPebble]] has been working on Natural Language Processing at scale. * Started with Apache UIMA: fairly simple @@ -35, +37 @@ To make life complicated there is a lot of noise on the detectors, timing problems can have stuff come in out of order. You need to do a lot of filtering and look for signals a long way off random noise before you can declare that you've found something interesting. Most physicists not only code as if they were writing FORTRAN, they never wrote good FORTRAN either. (this is a complaint by [[http://www.cs.utoronto.ca/~gvwilson/|Greg Wilson in Toronto]] - the computing departments never teach software engineering to all the scientists who are expected to code as part of their day to day science). - + HDFS has been used as a filestore in some of the US CMS Tier-2 sites, the new work that James discussed was that of actually treating physics problems as MapReduce jobs. They are bringing up a cluster of machines with storage for this, but would also like to use idle CPU time on other machines in the datacentre -there was some discussion on how to do this MAPREDUCE-1603 is now a feature request asking for a way to make the assessing of availability a feature that supported plugins. This would allow someone to write something that looked at non-Hadoop workload of machines and reduced the number Hadoop slots to report as being available when busy with other work. == Leo Simons: The BBC == @@ -45, +47 @@ * The web page is integrated with iplayer. * Friday afternoons are busy iPlayer times. People either skive off work or watch TV from their desk. * Lets you change your prefs -no need to login, the preferences are just bound to cookies - * Uses a hash of json to drive couchdb lookup, this lets them stay with 4M docs rather than 60M docs. + * Uses a hash of JSON to drive CouchDB lookup, this lets them stay with 4M docs rather than 60M docs. * They reach consistency in 40mS or so, no need for microsecond consistency as the rate of change of homepage is below that. * Compaction reduced the status display to "blue", rather than green, had everyone panicing but no visible change in behaviour. Moral: use light green instead. - Lots of fun with incomplete resharding causing intermittent replication failures. When an app saw a 404, it created a new doc as it expected this and kept going, created extra load and resulted in a 7h replication. + Lots of fun with incomplete resharding causing intermittent replication failures. When an app saw a 404, it created a new doc as it expected this and kept going, created extra load and resulted in a 7h replication. + == Steve Loughran, HP: New Roles in the cloud == + + [[http://www.slideshare.net/steve_l/new-roles-for-the-cloud | Slides]] + + Steve argued that with machine allocation/release being an API call away, you can avoid some of the problems of classic applications (needing large capital investment based on demand estimations), but there is a price: everything needs to be agile. There is no way to hard code hostnames into JSP, PHP or ASP pages; no way to offload High Availability problems to the hardware vendors. Your architects need to think about how to include load measurement in their design, how to make the application adapt to machines coming or going. Hadoop was cited as an example of an application designed to be un-agile: it does have hardcoded and cached hostnames in the configuration files; the workers' reaction to any NameNode or JobTracker failure is to spin waiting for it to come back, not to look up the hostnames in case they have moved. Similarly, the blacklisting process, while ideal for physical machines, is not the right way to deal with failures in virtual infrastructure, where the moment a machine starts playing up you ask for a new one. + + The talk concluded with a demo of the CloudFarmer prototype UI, which is a simple front end on a model-driven infrastructure. In CloudFarmer, one person specifies the machine roles with disk image options, VM requirements, a list of (protocol, port, path) strings for URLS, and some other values. The web and RESTy interfaces then let callers create instances of each role; the URL lists are turned into absolute values for the web UI to work with. + + Hadoop deployment with CloudFarmer was shown, and while HDFS came up, the JobTracker wasn't so happy. This led to a discussion on another problem in this world: debugging from log files in a world where the VMs can go away without much warning. + + == Tim @last.fm: Hive == + + [[http://users.last.fm/~tims/20100310-Hive.pdf | Slides of Hive @ last.fm]] + + * [[last.fm]] have been using Hive for 6 months + * Their cluster receivsd 600 events/second. On a par withTtwitter right now, but twitter "tweets" are growing faster and they have to do notifications + * Cluster: 44 nodes, each with 8 cores, 16 GB RAM and 4x1TB 7200 RPM storage (=704 GB RAM, 176 TB of storage, 352 cores). + * Charts: what's being played? + * Reporting: what they owe record companies? + * Corrections: cleaning up user supplied data. Most user data is pretty messy. + * Neighbours: finding similar users. + * Lots of queries about stuff -effectively a form of data warehousing. + * Hive tables can be one or more HDFS files; + * Hive also lets them import tables without bothering to pull into the HDFS filestore + * Some patches for various formats, Twitter did one for protocol buffers. + * No support for EchoIO; last.fm tried to do one and gave up, used Dumbo to import it into HDFS instead. + * External data is trusted more than anything else. Hive is not a database, just a query tool. + * Some queries can take minutes if they have to schedule MR jobs + Why Hive? + 1. Developers have some familiarity with SQL, especially the web team that live off python. Make queries like how many people hit a page. Business people don't do MySQL, like the advertising team; they bring the questions to the developers who use the console. + 1. Liked the ability to import from different sources + 1. It worked, at the time they looked, Pig didn't. + Example: Rage against the machine vs Joe from X-factor. + * Query: how many users listen to music on the radio after they've scrobbled from their own collection? + * The #of users that scrobble is << #of users that use the radio, but scrobbling users generate lots more data. + * Hive "explain" provides the execution plan. + Example: "reach": how many people have listened to an artist? + Example: "popularity": how often it is listened to + Workflow: scrobbles -> hive -> solr + === Weaknesses === + * No recordio record support + * Big joins used to OOM, but this seems to have gone away. It used to join in RAM, fast but would OOM. + * Pig? Better for exploding data -cross products. Could be better for user functions + * After Last.fm upgraded to Cloudera 0.20 Hadoop, Hive would start jobs but not finish them. They didn't upgrade Hive at first. Recompiled Hive and eventually it went away. Cloudera's Hive release fixed this. + * The thrift server for hive stopped working as the Cloudera version pointed to a different lib which led to conflicts. + Question: Has anyone tried to do any Object-Relational mappings? no. + + + == Sanders van der Waal: Community Engagement == + + [[http://www.slideshare.net/steve_l/community-engagement-3460225|Slides]] + + Sanders van der Waal from [[http://www.oss-watch.ac.uk/|OSSWatch]] gave a talk on Community Engagement. OSSWatch provide consultation and some support on Open Source in UK Higher Education and Universities, and have been getting involved in Hadoop in the past year as it makes a good platform for some scientific research, as well as a place for CS people to explore scheduling problems. + + Sanders emphasised that there is no Open Source community other than that which the users choose to make themselves; he also looked at the benefits of local groups -face to face discussion- with the risks -you are restricted in your contacts, and the discussions tend not to be archived/searchable as per mailing lists and bug tracker issues. + + There is a workshop in Oxford on June 24 & 25 on technology transfer, followed by a BarCamp on June 26; all are welcome. + + Sanders talk triggered an interesting discussion on whether the Grid model had delivered on what it had promised, or not. The answer: some stuff got addressed, but some things (storage) had been ignored, and turned out to be rather important. +
