Hi Chuck,

Thank you very much for this opportunity.   I also think it is a nice
case study; it goes beyond the typical wordcount example by generating
something that people can actually see and play with immediately
afterwards (e.g. maps).  It is also showcasing nicely the community
effort to collectively bring together information on the worlds
biodiversity - the GBIF network really is a nice example of a free and
open access community who are collectively addressing interoperability
globally.  Can you please tell me what kind of time frame you would
need the case study in?

I have just got my Java PNG generation code down to 130msec on the
Mac, so I am pretty much ready to start running on EC2 and do the
volume tile generation, so will blog the whole thing on
http://biodivertido.blogspot.com at some point soon.  I have to travel
to the US on Saturday for a week so this will delay it somewhat.

What is not 100% clear to me is when to push to S3:
In the Map I will output the TileId-ZoomLevel-SpeciesId as the key,
along with the count, and in the Reduce I group the counts into larger
tiles, and create the PNG.  I could write to Sequencefile here... but
I suspect I could just push to the s3 bucket here also - as long as
the task tracker does not send the same Keys to multiple reduce tasks
- my Hadoop naivity showing here (I wrote an in memory threaded
MapReduceLite which does not compete reducers, but not got into the
Hadoop code quite so much yet).


Cheers,

Tim



On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam <[email protected]> wrote:
> Hi Tim,
>
> I'm really interested in your application at gbif.org. I'm in the middle of
> writing Hadoop in Action ( http://www.manning.com/lam/ ) and think this may
> make for an interesting hadoop case study, since you're taking advantage of
> a lot of different pieces (EC2, S3, cloudfront, SequenceFiles,
> PHP/streaming). Would you be interested in discussing making a 4-5 page case
> study out of this?
>
> As to your question, I don't know if it's been properly answered, but I
> don't know why you think that "multiple tasks are running on the same
> section of the sequence file." Maybe you can elaborate further and I'll see
> if I can offer any thoughts.
>
>
>
>
> On Tue, Apr 14, 2009 at 7:10 AM, tim robertson <[email protected]>
> wrote:
>>
>> Sorry Brian, can I just ask please...
>>
>> I have the PNGs in the Sequence file for my sample set.  If I use a
>> second MR job and push to S3 in the map, surely I run into the
>> scenario where multiple tasks are running on the same section of the
>> sequence file and thus pushing the same data to S3.  Am I missing
>> something obvious (e.g. can I disable this behavior)?
>>
>> Cheers
>>
>> Tim
>>
>>
>> On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
>> <[email protected]> wrote:
>> > Thanks Brian,
>> >
>> > This is pretty much what I was looking for.
>> >
>> > Your calculations are correct but based on the assumption that at all
>> > zoom levels we will need all tiles generated.  Given the sparsity of
>> > data, it actually results in only a few 100GBs.  I'll run a second MR
>> > job with the map pushing to S3 then to make use of parallel loading.
>> >
>> > Cheers,
>> >
>> > Tim
>> >
>> >
>> > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <[email protected]>
>> > wrote:
>> >> Hey Tim,
>> >>
>> >> Why don't you put the PNGs in a SequenceFile in the output of your
>> >> reduce
>> >> task?  You could then have a post-processing step that unpacks the PNG
>> >> and
>> >> places it onto S3.  (If my numbers are correct, you're looking at
>> >> around 3TB
>> >> of data; is this right?  With that much, you might want another
>> >> separate Map
>> >> task to unpack all the files in parallel ... really depends on the
>> >> throughput you get to Amazon)
>> >>
>> >> Brian
>> >>
>> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> I am currently processing a lot of raw CSV data and producing a
>> >>> summary text file which I load into mysql.  On top of this I have a
>> >>> PHP application to generate tiles for google mapping (sample tile:
>> >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>> >>> Here is a (dev server) example of the final map client:
>> >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>> >>> dynamic grids as you zoom are all pre-calculated.
>> >>>
>> >>> I am considering (for better throughput as maps generate huge request
>> >>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>> >>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>> >>> each.
>> >>>
>> >>> Could someone please recommend the best place to generate the PNGs and
>> >>> when to push them to S3 in a MR system?
>> >>> If I did the PNG generation and upload to S3 in the reduce the same
>> >>> task on multiple machines will compete with each other right?  Should
>> >>> I generate the PNGs to a local directory and then on Task success push
>> >>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>> >>> good idea.
>> >>>
>> >>> I will use EC2 for the MR for the time being, but this will be moved
>> >>> to a local cluster still pushing to S3...
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Tim
>> >>
>> >>
>> >
>
>

Reply via email to