Sorry Brian, can I just ask please... I have the PNGs in the Sequence file for my sample set. If I use a second MR job and push to S3 in the map, surely I run into the scenario where multiple tasks are running on the same section of the sequence file and thus pushing the same data to S3. Am I missing something obvious (e.g. can I disable this behavior)?
Cheers Tim On Tue, Apr 14, 2009 at 2:44 PM, tim robertson <[email protected]> wrote: > Thanks Brian, > > This is pretty much what I was looking for. > > Your calculations are correct but based on the assumption that at all > zoom levels we will need all tiles generated. Given the sparsity of > data, it actually results in only a few 100GBs. I'll run a second MR > job with the map pushing to S3 then to make use of parallel loading. > > Cheers, > > Tim > > > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <[email protected]> wrote: >> Hey Tim, >> >> Why don't you put the PNGs in a SequenceFile in the output of your reduce >> task? You could then have a post-processing step that unpacks the PNG and >> places it onto S3. (If my numbers are correct, you're looking at around 3TB >> of data; is this right? With that much, you might want another separate Map >> task to unpack all the files in parallel ... really depends on the >> throughput you get to Amazon) >> >> Brian >> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote: >> >>> Hi all, >>> >>> I am currently processing a lot of raw CSV data and producing a >>> summary text file which I load into mysql. On top of this I have a >>> PHP application to generate tiles for google mapping (sample tile: >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). >>> Here is a (dev server) example of the final map client: >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the >>> dynamic grids as you zoom are all pre-calculated. >>> >>> I am considering (for better throughput as maps generate huge request >>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with >>> cloudfront. There will be billions of PNGs produced each at 1-3KB >>> each. >>> >>> Could someone please recommend the best place to generate the PNGs and >>> when to push them to S3 in a MR system? >>> If I did the PNG generation and upload to S3 in the reduce the same >>> task on multiple machines will compete with each other right? Should >>> I generate the PNGs to a local directory and then on Task success push >>> the lot up? I am assuming billions of 1-3KB files on HDFS is not a >>> good idea. >>> >>> I will use EC2 for the MR for the time being, but this will be moved >>> to a local cluster still pushing to S3... >>> >>> Cheers, >>> >>> Tim >> >> >
