One of my friends was asked this question on hadoop MapReduce - We have multiple stores and each stores have many customers visiting and buying stuff. the dataset consists of "Store#, Customer#, Quantity purchased". Need a MapReduce code to get the Top 2 customers for each store.
The solution which i thought of was to do a secondary sort on qty (in descending order - store + qty makes the composite key) and in the reducer just display first 2 values (or customers) for each Key (store + qty, qty is part of composite key). This works if the customer is unique, but if the customer has visited the same store multiple times then how do we do it? This would require summing up of the quantities for each customer and then getting the top 2. The solution which i thought was - SOLUTION 1 1. Key = Store + Customer 2. Make sure each store goes to the same reducer (Secondary sort/Key comparator etc). In Reducer - 3. Sum qty for each customer. 4. Add result in Map 5. Loop thru Map and find the top 2 (or use a Treemap to sort based on sum of qty). SOLUTION 2 or the other solution is to write 2 MapRed jobs which runs one after the other. The first one to get a sum of qty purchased for each customer and store. The second MapRed to sort by qty and get the top 2 buyers. The question is - Is there any other way of achieving this? In solution 1, If i use a Map and a TreeMap and then looping thru them for getting results will it be a problem if we have millions if customers? Should i be worried about keeping all of them in memory? If the sorting/looping is really costly, any other way you would advise? Sample Data: STORE1,CUST1,100 STORE1,CUST2,10 STORE1,CUST1,40 STORE1,CUST2,27 STORE1,CUST3,63 STORE1,CUST1,33 STORE1,CUST2,12 STORE2,CUST4,24 STORE2,CUST6,234 STORE2,CUST4,3 STORE2,CUST5,34 STORE2,CUST6,234 STORE2,CUST4,234 Output - STORE1,CUST1,173 STORE1,CUST3,63 STORE2,CUST6,468 STORE2,CUST4,261 Thanks!
