That code is highly optimized and quite difficult to follow. We have always limited our joins to 31 members and ignored the problem. But I think your jira and fixing it are the correct choices.
There is, in my opinion, a decent write up on how to use map side joins in chapter 8 of my book, so I suspect more people will use this soon, as map side join is an incredibly powerful tool. In one of our production applications it took the run time from 5+ hours to about 12 minutes. On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly <[email protected]> wrote: > Am I right in thinking that the CompositeInputFormat is limited to joining > 64 files? > > I believe this comes about because TupleWritable uses a single long-type > instance field in order to maintain a bitset of tuple slots that have been > written to - I'm guessing this is for performance reasons, but it also > implies that the TupleWritable only has 64-bits to play with when joining. > > If my assumptions above are true, could replacing this long with a > java.util.BitSet be appropiate in terms of making the map-side join package > more scalable? > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
