Hi all, I am currently processing a lot of raw CSV data and producing a summary text file which I load into mysql. On top of this I have a PHP application to generate tiles for google mapping (sample tile: http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). Here is a (dev server) example of the final map client: http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the dynamic grids as you zoom are all pre-calculated.
I am considering (for better throughput as maps generate huge request volumes) pregenerating all my tiles (PNG) and storing them in S3 with cloudfront. There will be billions of PNGs produced each at 1-3KB each. Could someone please recommend the best place to generate the PNGs and when to push them to S3 in a MR system? If I did the PNG generation and upload to S3 in the reduce the same task on multiple machines will compete with each other right? Should I generate the PNGs to a local directory and then on Task success push the lot up? I am assuming billions of 1-3KB files on HDFS is not a good idea. I will use EC2 for the MR for the time being, but this will be moved to a local cluster still pushing to S3... Cheers, Tim
