I have two files, A and D, containing (vectorId, vector) on each line.
|D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100

Now I want to execute the following

for eachItem in A:
    for eachElem in D:
        dot_product = eachItem * eachElem
        save(dot_product)


What I tried was to convert file D in to a MapFile in (key = vectorId,
value = vector) format and set up a hadoop job such that,
inputFile = A
inputFileFormat = NLineInputFormat

pseudo code for the map function:

map(key=vectorid, value=myVector):
    open(MapFile containing all vectors of D)
    for eachElem in MapFile:
        dot_product = myVector * eachElem
        context.write(dot_product)
    close(MapFile containing all vectors of D)


I was expecting that sequentially accessing the MapFile would be much
faster. When I took some stats on a single node with a smaller dataset
where |A| = 100 and |D| = 100,000 what I observed was that
total time taken to iterate over the MapFile = 738 secs
total time taken to compute the dot_product = 11 sec

My original intention to speed up the process using MapReduce is
defeated because of the io time involved in accessing each entry in
the MapFile. Are there any other avenues that I could explore?

Reply via email to