Hello, I would like to ask a question regarding the map side join. I am trying to understand the implementation of it and I would be grateful if you could tell me whether there is any I/O cost included. In detail, if we have 2 source files of 3 splits each (so as to ensure the constraints that is, sorted, partitioned etc.) then during map side join these 2 files are merged before the map function takes place. I am trying to comprehend how this merge is done. If I am not mistaken, each pair of corresponding splits is merged at a time. That is, first the splits(1) of both sources are taken into account.
How? Is this done in a 'on the fly' fashion (in-memory buffer)? Is there any file locally created? I read the relevant details about the iterators but I wonder about the memory requirements... If each split need to be in-memory stored so as to have an iterator over it, then there should be a requirement of memory space. Thank you! -- View this message in context: http://www.nabble.com/map-side-join-tp24722077p24722077.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
