Hi Wouter, funny think, I had the same exact problem and we were thinking about this issue. The idea here is that this observation for strings might not be always true, and it is a situation that cannot be always determined on the kernel level. Correct me if I am wrong, but your benefit on query comes because the hash in the large BAT is already there, that's why the second time you get 0.01? You mention hot run so I assume the BAT is already there with a hash index. While in the original situation the hash is on the small BAT thus you don't benefit from the hot run. But if a big BAT of strings is to be used again it is unknown in the gdk level. So, I solved the problem by forcing the hash index on the big BAT in a higher level (in Monet5) where it knows something more about the application (in my case RDF store). Can you do instead that? force the hash index in a higher level for you application? If gdk see a hash index already there, then it will choose that independent of the size.
lefteris On Fri, Dec 18, 2009 at 4:22 PM, Wouter Alink <[email protected]> wrote: > Dear developers, > > I would like to propose a change in GDK and hear opinions. It is about > the following issue: > > in the BATjoin code, if there is no possibility to do a fetch or merge > join, a hashjoin is performed. A hashtable is created for the smallest > BAT. The reasons (i could think of) for choosing the smallest BAT for > the hashtable are that less space is required for the hashtable (which > in turn causes less cache misses when doing a lookup) and also because > the hashfunction used is assumed to be very inexpensive (it needs to > be calculated for each item in the large bat each time a join is > performed). > I can see that the hashfunction can be very efficient for data types > without indirection, but I feel that for data types like strings in > some cases this is a little different. If a string BAT for example > contains many different values (i.e. is not a bat which contains > enumeration values) the hashfunction will not be inexpensive anymore > (many cache misses), as each hashfunction call needs to hash a whole > (arbitrary long) string at an arbitrary place in the heap. > > Is it perhaps possible to specify that, when a BAT of type 'str' has > many different values a hashtable may be build on the large BAT > instead of on the small BAT? > > Reason that I ask this: I was analysing costs of a query in which I > had a few short strings (26 tuples, 1-column table: varchar) which I > wanted to look up in a dictionary (9M tuples, 2-column table: > int,varchar). "SELECT a.id FROM longlist AS a JOIN smalllist as b ON > a.strvalue=b.strvalue;" > The result is a small list of integers (26 or less tuples). This > operation currently takes roughly 1.5 seconds for a hot run, mostly > due to 9M strHash operations. By applying the patch below the > execution time for a hot run dropped down to .01 seconds. The > performance gain is caused by only having to perform strHash on the > items in the small bat once the hashtable for the large bat has been > created. > > Any suggestions whether such a change is useful? Which benchmarks will > be influenced? > > I guess this code change is probably not useful for large string BATs > with only few different values, but perhaps a guess could be made how > diverse the strings in a bat are (by taking a sample or perhaps simply > by looking at the ratio batsize/heapsize), and based on that determine > whether to build it on the large or small BAT? > > Greetings, > Wouter > > > Index: src/gdk/gdk_relop.mx > =================================================================== > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_relop.mx,v > retrieving revision 1.167.2.4 > diff -u -r1.167.2.4 gdk_relop.mx > --- src/gdk/gdk_relop.mx 20 Nov 2009 13:04:06 -0000 1.167.2.4 > +++ src/gdk/gdk_relop.mx 18 Dec 2009 14:59:13 -0000 > @@ -1232,7 +1232,12 @@ > �...@- > hash join: the bread&butter join of monet > �...@c > - /* Simple rule, always build hash on the smallest */ > + /* Simple rule, always build hash on the smallest, > + except when it is a string-join, then we do the opposite */ > + if (swap && rcount < lcount && l->ttype == TYPE_str) { > + ALGODEBUG THRprintf(GDKout, "#BATjoin: > BATmirror(BAThashjoin(BATmirror(r), BATmirror(l)," BUNFMT "));\n", > estimate); > + return BATmirror(BAThashjoin(BATmirror(r), > BATmirror(l), estimate)); > + } > if (swap && rcount > lcount) { > ALGODEBUG THRprintf(GDKout, "#BATjoin: > BATmirror(BAThashjoin(BATmirror(r), BATmirror(l)," BUNFMT "));\n", > estimate); > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast and easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Monetdb-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/monetdb-developers > ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Monetdb-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/monetdb-developers
