[Hadoop Wiki] Update of "BristolHadoopWorkshopSpring2010 " by SteveLoughran

Apache Wiki Mon, 22 Mar 2010 07:51:26 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "BristolHadoopWorkshopSpring2010" page has been changed by SteveLoughran.
The comment on this change is: Sanders' talk.
http://wiki.apache.org/hadoop/BristolHadoopWorkshopSpring2010?action=diff&rev1=3&rev2=4

--------------------------------------------------

  This was a one-day event hosted by HP Laboratories, Bristol, and co-organised 
by HPLabs and Bristol University. It was a followup to the 
[[BristolHadoopWorkshop|2009 workshop]], again a meeting of locals to discuss 
what they were up to and look at Hadoop in physics, among other things.
  
  == Julien Nioche: Behemoth ==
+ 
+ [[http://www.slideshare.net/steve_l/digital-pebble-behemoth | Slides]]
  
  Julien Nioche at [[http://www.digitalpebble.com/|digitalPebble]] has been 
working on Natural Language Processing at scale.
   * Started with Apache UIMA: fairly simple
@@ -35, +37 @@

  To make life complicated there is a lot of noise on the detectors, timing 
problems can have stuff come in out of order. You need to do a lot of filtering 
and look for signals a long way off random noise before you can declare that 
you've found something interesting.
  
  Most physicists not only code as if they were writing FORTRAN, they never 
wrote good FORTRAN either. (this is a complaint by 
[[http://www.cs.utoronto.ca/~gvwilson/|Greg Wilson in Toronto]] - the computing 
departments never teach software engineering to all the scientists who are 
expected to code as part of their day to day science).
-  
+ 
  HDFS has been used as a filestore in some of the US CMS Tier-2 sites, the new 
work that James discussed was that of actually treating physics problems as 
MapReduce jobs. They are bringing up a cluster of machines with storage for 
this, but would also like to use idle CPU time on other machines in the 
datacentre -there was some discussion on how to do this MAPREDUCE-1603 is now a 
feature request asking for a way to make the assessing of availability a 
feature that supported plugins. This would allow someone to write something 
that looked at non-Hadoop workload of machines and reduced the number Hadoop 
slots to report as being available when busy with other work.
  
  == Leo Simons: The BBC  ==
@@ -45, +47 @@

   * The web page is integrated with iplayer.
   * Friday afternoons are busy iPlayer times. People either skive off work or 
watch TV from their desk.
   * Lets you change your prefs -no need to login, the preferences are just 
bound to cookies
-  * Uses a hash of json to drive couchdb lookup, this lets them stay with 4M 
docs rather than 60M docs.
+  * Uses a hash of JSON to drive CouchDB lookup, this lets them stay with 4M 
docs rather than 60M docs.
   * They reach consistency in 40mS or so, no need for microsecond consistency 
as the rate of change of  homepage is below that.
   * Compaction reduced the status display to "blue", rather than green, had 
everyone panicing but no visible change in behaviour. Moral: use light green 
instead.
- Lots of fun with incomplete resharding causing intermittent replication 
failures. When an app saw a 404, it created a new doc as it expected this and 
kept going, created extra load and resulted in a 7h replication. 
+ Lots of fun with incomplete resharding causing intermittent replication 
failures. When an app saw a 404, it created a new doc as it expected this and 
kept going, created extra load and resulted in a 7h replication.
  
+ == Steve Loughran, HP: New Roles in the cloud ==
+ 
+ [[http://www.slideshare.net/steve_l/new-roles-for-the-cloud | Slides]]
+ 
+ Steve argued that with machine allocation/release being an API call away, you 
can avoid some of the problems of classic applications (needing large capital 
investment based on demand estimations), but there is a price: everything needs 
to be agile. There is no way to hard code hostnames into JSP, PHP or ASP pages; 
no way to offload High Availability problems to the hardware vendors. Your 
architects need to think about how to include load measurement in their design, 
how to make the application adapt to machines coming or going. Hadoop was cited 
as an example of an application designed to be un-agile: it does have hardcoded 
and cached hostnames in the configuration files; the workers' reaction to any 
NameNode or JobTracker failure is to spin waiting for it to come back, not to 
look up the hostnames in case they have moved. Similarly, the blacklisting 
process, while ideal for physical machines, is not the right way to deal with 
failures in virtual infrastructure, where the moment a machine starts playing 
up you ask for a new one.
+ 
+ The talk concluded with a demo of the CloudFarmer prototype UI, which is a 
simple front end on a model-driven infrastructure. In CloudFarmer, one person 
specifies the machine roles with disk image options, VM requirements, a list of 
(protocol, port, path) strings for URLS, and some other values. The web and 
RESTy interfaces then let callers create instances of each role; the URL lists 
are turned into absolute values for the web UI to work with.
+ 
+ Hadoop deployment with CloudFarmer was shown, and while HDFS came up, the 
JobTracker wasn't so happy. This led to a discussion on another problem in this 
world: debugging from log files in a world where the VMs can go away without 
much warning.
+ 
+ == Tim @last.fm: Hive ==
+ 
+ [[http://users.last.fm/~tims/20100310-Hive.pdf | Slides of Hive @ last.fm]]
+ 
+  * [[last.fm]] have been using Hive for 6 months
+  * Their cluster receivsd 600 events/second. On a par withTtwitter right now, 
but twitter "tweets" are growing faster and they have to do notifications
+  * Cluster: 44 nodes, each with 8 cores, 16 GB RAM and 4x1TB 7200 RPM storage 
(=704 GB RAM, 176 TB of storage, 352 cores).
+  * Charts: what's being played?
+  * Reporting: what they owe record companies?
+  * Corrections: cleaning up user supplied data. Most user data is pretty 
messy.
+  * Neighbours: finding similar users.
+  * Lots of queries about stuff -effectively a form of data warehousing.
+  * Hive tables can be one or more HDFS files;
+  * Hive also lets them import tables without bothering to pull into the HDFS 
filestore
+  * Some patches for various formats, Twitter did one for protocol buffers.
+  * No support for EchoIO; last.fm tried to do one and gave up, used Dumbo to 
import it into HDFS instead.
+  * External data is trusted more than anything else. Hive is not a database, 
just a query tool.
+  * Some queries can take minutes if they have to schedule MR jobs
+ Why Hive?
+  1. Developers have some familiarity with SQL, especially the web team that 
live off python. Make queries like how many people hit a page. Business people 
don't do MySQL, like the advertising team; they bring the questions to the 
developers who use the console.
+  1. Liked the ability to import from different sources
+  1. It worked, at the time they looked, Pig didn't.
+ Example: Rage against the machine vs Joe from X-factor.
+  * Query: how many users listen to music on the radio after they've scrobbled 
from their own collection?
+  * The #of users that scrobble is << #of users that use the radio, but 
scrobbling users generate lots more data.
+  * Hive "explain" provides the execution plan.
+ Example: "reach": how many people have listened to an artist?
+ Example: "popularity": how often it is listened to
+ Workflow: scrobbles  -> hive -> solr
+ === Weaknesses ===
+  * No recordio record support
+  * Big joins used to OOM, but this seems to have gone away. It used to join 
in RAM, fast but would OOM.
+  * Pig? Better for exploding data -cross products. Could be better for user 
functions
+  * After Last.fm upgraded to Cloudera 0.20 Hadoop, Hive would start jobs but 
not finish them. They didn't upgrade Hive at first. Recompiled Hive and 
eventually it went away. Cloudera's Hive release fixed this.
+  * The thrift server for hive stopped working as the Cloudera version pointed 
to a different lib which led to conflicts.
+ Question: Has anyone tried to do any Object-Relational mappings? no.
+ 
+ 
+ == Sanders van der Waal: Community Engagement ==
+ 
+ [[http://www.slideshare.net/steve_l/community-engagement-3460225|Slides]]
+ 
+ Sanders van der Waal from [[http://www.oss-watch.ac.uk/|OSSWatch]] gave a 
talk on Community Engagement. OSSWatch provide consultation and some support on 
Open Source in UK Higher Education and Universities, and have been getting 
involved in Hadoop in the past year as it makes a good platform for some 
scientific research, as well as a place for CS people to explore scheduling 
problems.
+ 
+ Sanders emphasised that there is no Open Source community other than that 
which the users choose to make themselves; he also looked at the benefits of 
local groups -face to face discussion- with the risks -you are restricted in 
your contacts, and the discussions tend not to be archived/searchable as per 
mailing lists and bug tracker issues.
+ 
+ There is a workshop in Oxford on June 24 & 25 on technology transfer, 
followed by a BarCamp on June 26; all are welcome.
+ 
+ Sanders talk triggered an interesting discussion on whether the Grid model 
had delivered on what it had promised, or not. The answer: some stuff got 
addressed, but some things (storage) had been ignored, and turned out to be 
rather important.
+

[Hadoop Wiki] Update of "BristolHadoopWorkshopSpring2010 " by SteveLoughran

Reply via email to