Thanks Todd and Chuck - sorry, my terminology was wrong... exactly what I was looking for.
I am letting mysql chuck throught the zoom levels now to get some final numbers on the tiles and cost to S3 PUT. Looks like zoom level 8 is feasible for our current data volume but not a long term option if the input data explodes in volume. Cheers, Tim On Thu, Apr 16, 2009 at 9:05 PM, Chuck Lam <[email protected]> wrote: > ar.. i totally missed the point you had said about "compete reducers". it > didn't occur to me that you were talking about hadoop's speculative > execution. todd's solution to turn off speculative execution is correct. > > i'll respond to the rest of your email later today. > > > > On Thu, Apr 16, 2009 at 5:23 AM, tim robertson <[email protected]> > wrote: >> >> Thanks Chuck, >> >> > I'm shooting for finishing the case studies by the end of May, but it'll >> > be >> > nice to have a draft done by mid-May so we can edit it to have a >> > consistent >> > style with the other case studies. >> >> I will do what I can! >> >> > I read your blog and found a couple posts on spatial joining. It wasn't >> > clear to me from reading the posts whether the work was just >> > experimental or >> > if it led to some application. If it led to an application, then we may >> > incorporate that into the case study too. >> >> It led to http://widgets.gbif.org/test/PACountry.html#/area/2571 which >> shows a statistical summary for our data (latitude longitude) >> cross-referenced with the polygons on the protected areas of the >> world. In truth though, we processed it in PostGIS and Hadoop and >> found that the PostGIS approach, while way slower was fine for now and >> we developed the scripts for that quicker. So you can say it was >> experimental... I do have ambitions to do a basic geospatial join >> (points in polygons) for PIG, Cloudbase or Hive2.0 but alas have not >> found time. Also - the blog is always a late Sunday night effort so >> really is not written well. >> >> > BTW, where in the US are you traveling to? I'm in Silicon Valley, so >> > maybe >> > we can meet up if you'll happen to be in the area and can squeeze a >> > little >> > time out. >> >> Would have loved to... but in Boston and DC this time. In a few weeks >> will be in Chicago, but for some reason I have never make it over your >> neck of the woods. >> >> > I don't know what data you need to produce a single PNG file, so I don't >> > know whether having map output TileId-ZoomLevel-SpeciesId as key is the >> > right factoring. To me it looks like each PNG represents one tile at one >> > zoom level but includes multiple species. >> >> We do individual species and higher levels of taxa (up to all data). >> This is all data, grouped to 1x1 degree cells (think 100x100 km) with >> counts. Currently preprocessed with mysql, but another hadoop >> candidate as we grow. >> >> http://maps.gbif.org/mapserver/draw.pl?dtype=box&imgonly=1&path=http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F13140803&extent=-180.0+-90.0+180.0+90.0&mode=browse&refresh=Refresh&layer=countryborders >> >> > In any case, under Hadoop/MapReduce, all key/value pairs outputted by >> > the >> > mappers are grouped by key before being sent to the reducer, so it's >> > guaranteed that the same key will not go to multiple reducers. >> >> That is good to know. I knew Map tasks would get run on multiple >> machines if it detects a machine is idle, but wasn't sure if Hadoop >> would put reducers on machines to compete against each other and kill >> the one that did not finish first. >> >> > You may also want to think more about the actual volume and cost of all >> > this. You initially said that you will have "billions of PNGs produced >> > each >> > at 1-3KB" but then later said the data size is only a few 100GB due to >> > sparsity. Either you're not really creating billions of PNGs, or a lot >> > of >> > them are actually less than 1KB. Kevin brought up a good point that S3 >> > charges $0.01 for every 1000 files ("objects") created, so generating 1 >> > billion files will already set you back $10K plus storage cost (and >> > transfer >> > cost if you're not using EC2). >> >> Right - my bad... Having not processed this all I am not 100% sure yet >> what the size will be and to what zoom level I will preprocess to. >> The challenge is our data is growing continuously, so billions of PNGs >> was looking into the coming months. Sorry for the contradiction. >> >> You have clearly spotted that I am doing this as a project on the side >> (evenings really) and not devoting enough time to this!!! By day I am >> mysql and postgis still but I am hitting limits and looking to our >> scalability. >> I kind of overlooked the PUT cost on S3 thinking stupidly that EC2->S3 was >> free. >> >> I actually have the stuff processed for species only using mysql >> (http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800) but not >> the higher groupings of species (familys of species etc). It could be >> that I end up only processing all the summary data in Hadoop and then >> load back into a light DB to render the maps in real time like the >> link I just provided. TIles render in around 150msecs so with some >> hardware we could probably scale.... >> >> Thanks for your inputs - I appreciate it a lot since I'm working >> mostly alone on the processing. >> >> Cheers, >> >> Tim >> >> > >> > >> > >> > On Thu, Apr 16, 2009 at 1:27 AM, tim robertson >> > <[email protected]> >> > wrote: >> >> >> >> Hi Chuck, >> >> >> >> Thank you very much for this opportunity. I also think it is a nice >> >> case study; it goes beyond the typical wordcount example by generating >> >> something that people can actually see and play with immediately >> >> afterwards (e.g. maps). It is also showcasing nicely the community >> >> effort to collectively bring together information on the worlds >> >> biodiversity - the GBIF network really is a nice example of a free and >> >> open access community who are collectively addressing interoperability >> >> globally. Can you please tell me what kind of time frame you would >> >> need the case study in? >> >> >> >> I have just got my Java PNG generation code down to 130msec on the >> >> Mac, so I am pretty much ready to start running on EC2 and do the >> >> volume tile generation, so will blog the whole thing on >> >> http://biodivertido.blogspot.com at some point soon. I have to travel >> >> to the US on Saturday for a week so this will delay it somewhat. >> >> >> >> What is not 100% clear to me is when to push to S3: >> >> In the Map I will output the TileId-ZoomLevel-SpeciesId as the key, >> >> along with the count, and in the Reduce I group the counts into larger >> >> tiles, and create the PNG. I could write to Sequencefile here... but >> >> I suspect I could just push to the s3 bucket here also - as long as >> >> the task tracker does not send the same Keys to multiple reduce tasks >> >> - my Hadoop naivity showing here (I wrote an in memory threaded >> >> MapReduceLite which does not compete reducers, but not got into the >> >> Hadoop code quite so much yet). >> >> >> >> >> >> Cheers, >> >> >> >> Tim >> >> >> >> >> >> >> >> On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam <[email protected]> wrote: >> >> > Hi Tim, >> >> > >> >> > I'm really interested in your application at gbif.org. I'm in the >> >> > middle >> >> > of >> >> > writing Hadoop in Action ( http://www.manning.com/lam/ ) and think >> >> > this >> >> > may >> >> > make for an interesting hadoop case study, since you're taking >> >> > advantage >> >> > of >> >> > a lot of different pieces (EC2, S3, cloudfront, SequenceFiles, >> >> > PHP/streaming). Would you be interested in discussing making a 4-5 >> >> > page >> >> > case >> >> > study out of this? >> >> > >> >> > As to your question, I don't know if it's been properly answered, but >> >> > I >> >> > don't know why you think that "multiple tasks are running on the same >> >> > section of the sequence file." Maybe you can elaborate further and >> >> > I'll >> >> > see >> >> > if I can offer any thoughts. >> >> > >> >> > >> >> > >> >> > >> >> > On Tue, Apr 14, 2009 at 7:10 AM, tim robertson >> >> > <[email protected]> >> >> > wrote: >> >> >> >> >> >> Sorry Brian, can I just ask please... >> >> >> >> >> >> I have the PNGs in the Sequence file for my sample set. If I use a >> >> >> second MR job and push to S3 in the map, surely I run into the >> >> >> scenario where multiple tasks are running on the same section of the >> >> >> sequence file and thus pushing the same data to S3. Am I missing >> >> >> something obvious (e.g. can I disable this behavior)? >> >> >> >> >> >> Cheers >> >> >> >> >> >> Tim >> >> >> >> >> >> >> >> >> On Tue, Apr 14, 2009 at 2:44 PM, tim robertson >> >> >> <[email protected]> wrote: >> >> >> > Thanks Brian, >> >> >> > >> >> >> > This is pretty much what I was looking for. >> >> >> > >> >> >> > Your calculations are correct but based on the assumption that at >> >> >> > all >> >> >> > zoom levels we will need all tiles generated. Given the sparsity >> >> >> > of >> >> >> > data, it actually results in only a few 100GBs. I'll run a second >> >> >> > MR >> >> >> > job with the map pushing to S3 then to make use of parallel >> >> >> > loading. >> >> >> > >> >> >> > Cheers, >> >> >> > >> >> >> > Tim >> >> >> > >> >> >> > >> >> >> > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman >> >> >> > <[email protected]> >> >> >> > wrote: >> >> >> >> Hey Tim, >> >> >> >> >> >> >> >> Why don't you put the PNGs in a SequenceFile in the output of >> >> >> >> your >> >> >> >> reduce >> >> >> >> task? You could then have a post-processing step that unpacks >> >> >> >> the >> >> >> >> PNG >> >> >> >> and >> >> >> >> places it onto S3. (If my numbers are correct, you're looking at >> >> >> >> around 3TB >> >> >> >> of data; is this right? With that much, you might want another >> >> >> >> separate Map >> >> >> >> task to unpack all the files in parallel ... really depends on >> >> >> >> the >> >> >> >> throughput you get to Amazon) >> >> >> >> >> >> >> >> Brian >> >> >> >> >> >> >> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote: >> >> >> >> >> >> >> >>> Hi all, >> >> >> >>> >> >> >> >>> I am currently processing a lot of raw CSV data and producing a >> >> >> >>> summary text file which I load into mysql. On top of this I >> >> >> >>> have a >> >> >> >>> PHP application to generate tiles for google mapping (sample >> >> >> >>> tile: >> >> >> >>> >> >> >> >>> >> >> >> >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). >> >> >> >>> Here is a (dev server) example of the final map client: >> >> >> >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - >> >> >> >>> the >> >> >> >>> dynamic grids as you zoom are all pre-calculated. >> >> >> >>> >> >> >> >>> I am considering (for better throughput as maps generate huge >> >> >> >>> request >> >> >> >>> volumes) pregenerating all my tiles (PNG) and storing them in S3 >> >> >> >>> with >> >> >> >>> cloudfront. There will be billions of PNGs produced each at >> >> >> >>> 1-3KB >> >> >> >>> each. >> >> >> >>> >> >> >> >>> Could someone please recommend the best place to generate the >> >> >> >>> PNGs >> >> >> >>> and >> >> >> >>> when to push them to S3 in a MR system? >> >> >> >>> If I did the PNG generation and upload to S3 in the reduce the >> >> >> >>> same >> >> >> >>> task on multiple machines will compete with each other right? >> >> >> >>> Should >> >> >> >>> I generate the PNGs to a local directory and then on Task >> >> >> >>> success >> >> >> >>> push >> >> >> >>> the lot up? I am assuming billions of 1-3KB files on HDFS is >> >> >> >>> not a >> >> >> >>> good idea. >> >> >> >>> >> >> >> >>> I will use EC2 for the MR for the time being, but this will be >> >> >> >>> moved >> >> >> >>> to a local cluster still pushing to S3... >> >> >> >>> >> >> >> >>> Cheers, >> >> >> >>> >> >> >> >>> Tim >> >> >> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> > >> > > >
