On Thu, Jul 24, 2008 at 8:03 AM, Amber <[EMAIL PROTECTED]> wrote: > Yes, I think this is the simplest method , but there are problems too: > > 1. The reduce stage wouldn't begin until the map stage ends, by when we have > done a two table scanning, and the comparing will take almost the same time, > because about 90% of intermediate <key, value> pairs will have two values and > different keys, if I can specify a number n, by when there are n intermediate > pairs with the same key the reduce tasks start, that will be better. In my > case I will set the magic number to 2.
I don't think I understood this completely, but I'll try to respond. First, I think you're going to be doing something like two full table scans in any case. Whether it's in an RDBMS or in Hadoop, you need to read the complete dataset for both day1 and day2. (Or at least that's how I interpreted your original mail - you're not trying to keep deltas over N days, just doing a delta for yesterday/today from scratch every time) You could possibly speed this up by keeping some kind of parsed data in hadoop for previous days, rather than just text, but I wouldn't do this as my first solution. It seems like starting the reducers before the maps are done isn't going to buy you anything. The same amount of total work needs to be done; when the work starts doesn't matter much. In this case, I'm guessing that you're going to have a setup where (total number of maps) == (total number of reducers) == 4 * (number of 4-core machines). In any case, I'd say you should do some experiments with the most simple solution you can come up with. Your problem seems simple enough that just banging out some throwaway experimental code is going to a) not take very long, and b) tell you quite a bit about how your particular solution is going to behave in the real world. > > 2. I am not sure about how Hadoop stores intermediate <key, value> pairs, we > would not afford it as data volume increasing if it is kept in memory. Hadoop is definitely prepared for very large numbers of intermediate key/value pairs - that's pretty much the normal case for hadoop jobs. It'll stream to/from disc as necessary. Take a look at combiners as well - they may buy you something. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
