Yes, we are leaning on the map-side join package quite heavily too - it is an excellent addition to the MapReduce model that's proving really useful. However, while HADOOP-5571 is an immediate problem for us, I can imagine that we will probably be wanting to join over 64 files soon as well, especially if we move onto larger clusters.
2009/3/25 jason hadoop <[email protected]> > That code is highly optimized and quite difficult to follow. We have always > limited our joins to 31 members and ignored the problem. > But I think your jira and fixing it are the correct choices. > > There is, in my opinion, a decent write up on how to use map side joins in > chapter 8 of my book, so I suspect more people will use this soon, as map > side join is an incredibly powerful tool. > > In one of our production applications it took the run time from 5+ hours to > about 12 minutes. > > On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly <[email protected]> wrote: > > > Am I right in thinking that the CompositeInputFormat is limited to > joining > > 64 files? > > > > I believe this comes about because TupleWritable uses a single long-type > > instance field in order to maintain a bitset of tuple slots that have > been > > written to - I'm guessing this is for performance reasons, but it also > > implies that the TupleWritable only has 64-bits to play with when > joining. > > > > If my assumptions above are true, could replacing this long with a > > java.util.BitSet be appropiate in terms of making the map-side join > package > > more scalable? > > > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 >
