Thanks Todd and Chuck - sorry, my terminology was wrong... exactly
what I was looking for.

I am letting mysql chuck throught the zoom levels now to get some
final numbers on the tiles and cost to S3 PUT.  Looks like zoom level
8 is feasible for our current data volume but not a long term option
if the input data explodes in volume.

Cheers,

Tim



On Thu, Apr 16, 2009 at 9:05 PM, Chuck Lam <[email protected]> wrote:
> ar.. i totally missed the point you had said about "compete reducers". it
> didn't occur to me that you were talking about hadoop's speculative
> execution. todd's solution to turn off speculative execution is correct.
>
> i'll respond to the rest of your email later today.
>
>
>
> On Thu, Apr 16, 2009 at 5:23 AM, tim robertson <[email protected]>
> wrote:
>>
>> Thanks Chuck,
>>
>> > I'm shooting for finishing the case studies by the end of May, but it'll
>> > be
>> > nice to have a draft done by mid-May so we can edit it to have a
>> > consistent
>> > style with the other case studies.
>>
>> I will do what I can!
>>
>> > I read your blog and found a couple posts on spatial joining. It wasn't
>> > clear to me from reading the posts whether the work was just
>> > experimental or
>> > if it led to some application. If it led to an application, then we may
>> > incorporate that into the case study too.
>>
>> It led to http://widgets.gbif.org/test/PACountry.html#/area/2571 which
>> shows a statistical summary for our data (latitude longitude)
>> cross-referenced with the polygons on the protected areas of the
>> world.  In truth though, we processed it in PostGIS and Hadoop and
>> found that the PostGIS approach, while way slower was fine for now and
>> we developed the scripts for that quicker.  So you can say it was
>> experimental... I do have ambitions to do a basic geospatial join
>> (points in polygons) for PIG, Cloudbase or Hive2.0 but alas have not
>> found time.  Also - the blog is always a late Sunday night effort so
>> really is not written well.
>>
>> > BTW, where in the US are you traveling to? I'm in Silicon Valley, so
>> > maybe
>> > we can meet up if you'll happen to be in the area and can squeeze a
>> > little
>> > time out.
>>
>> Would have loved to... but in Boston and DC this time.  In a few weeks
>> will be in Chicago, but for some reason I have never make it over your
>> neck of the woods.
>>
>> > I don't know what data you need to produce a single PNG file, so I don't
>> > know whether having map output TileId-ZoomLevel-SpeciesId as key is the
>> > right factoring. To me it looks like each PNG represents one tile at one
>> > zoom level but includes multiple species.
>>
>> We do individual species and higher levels of taxa (up to all data).
>> This is all data, grouped to 1x1 degree cells (think 100x100 km) with
>> counts.  Currently preprocessed with mysql, but another hadoop
>> candidate as we grow.
>>
>> http://maps.gbif.org/mapserver/draw.pl?dtype=box&imgonly=1&path=http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F13140803&extent=-180.0+-90.0+180.0+90.0&mode=browse&refresh=Refresh&layer=countryborders
>>
>> > In any case, under Hadoop/MapReduce, all key/value pairs outputted by
>> > the
>> > mappers are grouped by key before being sent to the reducer, so it's
>> > guaranteed that the same key will not go to multiple reducers.
>>
>> That is good to know.  I knew Map tasks would get run on multiple
>> machines if it detects a machine is idle, but wasn't sure if Hadoop
>> would put reducers on machines to compete against each other and kill
>> the one that did not finish first.
>>
>> > You may also want to think more about the actual volume and cost of all
>> > this. You initially said that you will have "billions of PNGs produced
>> > each
>> > at 1-3KB" but then later said the data size is only a few 100GB due to
>> > sparsity. Either you're not really creating billions of PNGs, or a lot
>> > of
>> > them are actually less than 1KB. Kevin brought up a good point that S3
>> > charges $0.01 for every 1000 files ("objects") created, so generating 1
>> > billion files will already set you back $10K plus storage cost (and
>> > transfer
>> > cost if you're not using EC2).
>>
>> Right - my bad... Having not processed this all I am not 100% sure yet
>> what the size will be and to what zoom level I will preprocess to.
>> The challenge is our data is growing continuously, so billions of PNGs
>> was looking into the coming months.  Sorry for the contradiction.
>>
>> You have clearly spotted that I am doing this as a project on the side
>> (evenings really) and not devoting enough time to this!!!  By day I am
>> mysql and postgis still but I am hitting limits and looking to our
>> scalability.
>> I kind of overlooked the PUT cost on S3 thinking stupidly that EC2->S3 was
>> free.
>>
>> I actually have the stuff processed for species only using mysql
>> (http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800) but not
>> the higher groupings of species (familys of species etc).  It could be
>> that I end up only processing all the summary data in Hadoop and then
>> load back into a light DB to render the maps in real time like the
>> link I just provided.  TIles render in around 150msecs so with some
>> hardware we could probably scale....
>>
>> Thanks for your inputs - I appreciate it a lot since I'm working
>> mostly alone on the processing.
>>
>> Cheers,
>>
>> Tim
>>
>> >
>> >
>> >
>> > On Thu, Apr 16, 2009 at 1:27 AM, tim robertson
>> > <[email protected]>
>> > wrote:
>> >>
>> >> Hi Chuck,
>> >>
>> >> Thank you very much for this opportunity.   I also think it is a nice
>> >> case study; it goes beyond the typical wordcount example by generating
>> >> something that people can actually see and play with immediately
>> >> afterwards (e.g. maps).  It is also showcasing nicely the community
>> >> effort to collectively bring together information on the worlds
>> >> biodiversity - the GBIF network really is a nice example of a free and
>> >> open access community who are collectively addressing interoperability
>> >> globally.  Can you please tell me what kind of time frame you would
>> >> need the case study in?
>> >>
>> >> I have just got my Java PNG generation code down to 130msec on the
>> >> Mac, so I am pretty much ready to start running on EC2 and do the
>> >> volume tile generation, so will blog the whole thing on
>> >> http://biodivertido.blogspot.com at some point soon.  I have to travel
>> >> to the US on Saturday for a week so this will delay it somewhat.
>> >>
>> >> What is not 100% clear to me is when to push to S3:
>> >> In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
>> >> along with the count, and in the Reduce I group the counts into larger
>> >> tiles, and create the PNG.  I could write to Sequencefile here... but
>> >> I suspect I could just push to the s3 bucket here also - as long as
>> >> the task tracker does not send the same Keys to multiple reduce tasks
>> >> - my Hadoop naivity showing here (I wrote an in memory threaded
>> >> MapReduceLite which does not compete reducers, but not got into the
>> >> Hadoop code quite so much yet).
>> >>
>> >>
>> >> Cheers,
>> >>
>> >> Tim
>> >>
>> >>
>> >>
>> >> On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam <[email protected]> wrote:
>> >> > Hi Tim,
>> >> >
>> >> > I'm really interested in your application at gbif.org. I'm in the
>> >> > middle
>> >> > of
>> >> > writing Hadoop in Action ( http://www.manning.com/lam/ ) and think
>> >> > this
>> >> > may
>> >> > make for an interesting hadoop case study, since you're taking
>> >> > advantage
>> >> > of
>> >> > a lot of different pieces (EC2, S3, cloudfront, SequenceFiles,
>> >> > PHP/streaming). Would you be interested in discussing making a 4-5
>> >> > page
>> >> > case
>> >> > study out of this?
>> >> >
>> >> > As to your question, I don't know if it's been properly answered, but
>> >> > I
>> >> > don't know why you think that "multiple tasks are running on the same
>> >> > section of the sequence file." Maybe you can elaborate further and
>> >> > I'll
>> >> > see
>> >> > if I can offer any thoughts.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Apr 14, 2009 at 7:10 AM, tim robertson
>> >> > <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Sorry Brian, can I just ask please...
>> >> >>
>> >> >> I have the PNGs in the Sequence file for my sample set.  If I use a
>> >> >> second MR job and push to S3 in the map, surely I run into the
>> >> >> scenario where multiple tasks are running on the same section of the
>> >> >> sequence file and thus pushing the same data to S3.  Am I missing
>> >> >> something obvious (e.g. can I disable this behavior)?
>> >> >>
>> >> >> Cheers
>> >> >>
>> >> >> Tim
>> >> >>
>> >> >>
>> >> >> On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
>> >> >> <[email protected]> wrote:
>> >> >> > Thanks Brian,
>> >> >> >
>> >> >> > This is pretty much what I was looking for.
>> >> >> >
>> >> >> > Your calculations are correct but based on the assumption that at
>> >> >> > all
>> >> >> > zoom levels we will need all tiles generated.  Given the sparsity
>> >> >> > of
>> >> >> > data, it actually results in only a few 100GBs.  I'll run a second
>> >> >> > MR
>> >> >> > job with the map pushing to S3 then to make use of parallel
>> >> >> > loading.
>> >> >> >
>> >> >> > Cheers,
>> >> >> >
>> >> >> > Tim
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman
>> >> >> > <[email protected]>
>> >> >> > wrote:
>> >> >> >> Hey Tim,
>> >> >> >>
>> >> >> >> Why don't you put the PNGs in a SequenceFile in the output of
>> >> >> >> your
>> >> >> >> reduce
>> >> >> >> task?  You could then have a post-processing step that unpacks
>> >> >> >> the
>> >> >> >> PNG
>> >> >> >> and
>> >> >> >> places it onto S3.  (If my numbers are correct, you're looking at
>> >> >> >> around 3TB
>> >> >> >> of data; is this right?  With that much, you might want another
>> >> >> >> separate Map
>> >> >> >> task to unpack all the files in parallel ... really depends on
>> >> >> >> the
>> >> >> >> throughput you get to Amazon)
>> >> >> >>
>> >> >> >> Brian
>> >> >> >>
>> >> >> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>> >> >> >>
>> >> >> >>> Hi all,
>> >> >> >>>
>> >> >> >>> I am currently processing a lot of raw CSV data and producing a
>> >> >> >>> summary text file which I load into mysql.  On top of this I
>> >> >> >>> have a
>> >> >> >>> PHP application to generate tiles for google mapping (sample
>> >> >> >>> tile:
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>> >> >> >>> Here is a (dev server) example of the final map client:
>> >> >> >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 -
>> >> >> >>> the
>> >> >> >>> dynamic grids as you zoom are all pre-calculated.
>> >> >> >>>
>> >> >> >>> I am considering (for better throughput as maps generate huge
>> >> >> >>> request
>> >> >> >>> volumes) pregenerating all my tiles (PNG) and storing them in S3
>> >> >> >>> with
>> >> >> >>> cloudfront.  There will be billions of PNGs produced each at
>> >> >> >>> 1-3KB
>> >> >> >>> each.
>> >> >> >>>
>> >> >> >>> Could someone please recommend the best place to generate the
>> >> >> >>> PNGs
>> >> >> >>> and
>> >> >> >>> when to push them to S3 in a MR system?
>> >> >> >>> If I did the PNG generation and upload to S3 in the reduce the
>> >> >> >>> same
>> >> >> >>> task on multiple machines will compete with each other right?
>> >> >> >>>  Should
>> >> >> >>> I generate the PNGs to a local directory and then on Task
>> >> >> >>> success
>> >> >> >>> push
>> >> >> >>> the lot up?  I am assuming billions of 1-3KB files on HDFS is
>> >> >> >>> not a
>> >> >> >>> good idea.
>> >> >> >>>
>> >> >> >>> I will use EC2 for the MR for the time being, but this will be
>> >> >> >>> moved
>> >> >> >>> to a local cluster still pushing to S3...
>> >> >> >>>
>> >> >> >>> Cheers,
>> >> >> >>>
>> >> >> >>> Tim
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Reply via email to