I have two files, A and D, containing (vectorId, vector) on each line. |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
Now I want to execute the following for eachItem in A: for eachElem in D: dot_product = eachItem * eachElem save(dot_product) What I tried was to convert file D in to a MapFile in (key = vectorId, value = vector) format and set up a hadoop job such that, inputFile = A inputFileFormat = NLineInputFormat pseudo code for the map function: map(key=vectorid, value=myVector): open(MapFile containing all vectors of D) for eachElem in MapFile: dot_product = myVector * eachElem context.write(dot_product) close(MapFile containing all vectors of D) I was expecting that sequentially accessing the MapFile would be much faster. When I took some stats on a single node with a smaller dataset where |A| = 100 and |D| = 100,000 what I observed was that total time taken to iterate over the MapFile = 738 secs total time taken to compute the dot_product = 11 sec My original intention to speed up the process using MapReduce is defeated because of the io time involved in accessing each entry in the MapFile. Are there any other avenues that I could explore?