Thanks Kevin, "... well, you're doing it wrong." This is what I'm afraid of :o)
I know the TaskTracker for the Maps for example can run on the same part of the input file but not so sure on the Reduce. In the reduce, will the same keys be run on multiple machines in competition? On Thu, Apr 16, 2009 at 2:21 AM, Kevin Peterson <[email protected]> wrote: > On Tue, Apr 14, 2009 at 2:35 AM, tim robertson > <[email protected]>wrote: > >> >> I am considering (for better throughput as maps generate huge request >> volumes) pregenerating all my tiles (PNG) and storing them in S3 with >> cloudfront. There will be billions of PNGs produced each at 1-3KB >> each. >> > > Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine, > there is no need to generate them and then push them at once, if you are > storing them each in their own S3 object (which they must be, if you intend > to fetch them using cloudfront). Each S3 object is unique, and can be > written fully in parallel. If you are writing to the same S3 object twice, > ... well, you're doing it wrong. > > However, do the math on the costs for S3. We were doing something similar, > and found that we were spending a fortune on our put requests at $0.01 per > 1000, and next to nothing on storage. I've since moved to a more complicated > model where I pack many small items in each object and store an index in > simpledb. You'll need to partition your SimpleDBs if you do this. >
