Update of /cvsroot/monetdb/MonetDB5/src/modules/mal
In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv15857/src/modules/mal

Modified Files:
        batxml.mx radix.mx 
Log Message:
propagated changes of Sunday Feb 03 2008 - Friday Feb 08 2008
from the MonetDB_5-4 branch to the development trunk


Index: radix.mx
===================================================================
RCS file: /cvsroot/monetdb/MonetDB5/src/modules/mal/radix.mx,v
retrieving revision 1.15
retrieving revision 1.16
diff -u -d -r1.15 -r1.16
--- radix.mx    11 Jan 2008 10:41:38 -0000      1.15
+++ radix.mx    8 Feb 2008 22:36:34 -0000       1.16
@@ -28,7 +28,6 @@
 are applied in Monet's join processing strategy.
 
 @+ The Memory Story
[EMAIL PROTECTED]
 Computer RAM carries the acronym Random Access Memory, indicating that
 the memory access speed is independent of memory location. While this
 is still (mostly) true, the imbalance in speed improvements between CPU
@@ -43,7 +42,7 @@
 and a L2 (level-two) cache (typical size 256-2MB, typical latency 10-30ns). 
The L1
 and sometimes even the L2 are now located on the CPU chip itself in order to 
reduce latency.
 
-Cache memories are organized in {\em cache lines} of a fixed width. A typical 
width for
+Cache memories are organized in @emph{cache lines} of a fixed width. A typical 
width for
 an L1 line is 32 bytes, whereas a line L2 can be 32-128 bytes long. A cache 
line
 is the smallest unit of transfer, which means that on a miss, the memory 
system fetches
 all bytes of the line from the lower levels of the memory hierarchy in one go.
@@ -52,14 +51,14 @@
 as the different bytes of a cache line arrive in a sequential burst, but the 
total difference
 in arrival-time tends to be no more than 4 cycles apart from first to last 
byte.  This design of
 the memory caches (on all levels, but with varying size and line width 
parameters) has consequences
-for application performance, whose severity depend on what kind of {\em memory 
access patterns}
+for application performance, whose severity depend on what kind of 
@emph{memory access patterns}
 the application exhibits. Reading (or writing) a memory location that is not 
in the cache, causes
 a miss, and the CPU is forced for waiting the latency period. It is obvious 
that subsequent
 reading of this exact same data will not cause sub-sequent cache misses, and 
be fast -- because
 this data is already in cache.  But what happens with a sequential access 
pattern to data that
 is not in the cache? Again, the first read causes a cache miss. However, 
subsequent reads
 to adjacent bytes access the same cache line that is already loaded and 
therefore
-do {\em not} cause any cache misses. This is the reason why sequential memory 
access
+do @emph{not} cause any cache misses. This is the reason why sequential memory 
access
 is cheaper than random access, as the latter pattern may cause a cache miss on 
every
 read (considering the access is to an uncached memory region).
 
@@ -73,7 +72,6 @@
 their time on memory cache misses.
 
 @- Optimizing Memory Access during Join
[EMAIL PROTECTED]
 We focus here on improving the memory access performance of the join
 operator in order to gain performance. This is relevant, because the
 most popular main-memory join algorithm is hash-join, which exhibits a
@@ -94,16 +92,18 @@
 the access pattern of the join operation, in such a way that the overall
 number of cache misses is reduced. This is typically achieved by doing
 extra CPU work and/or by making additional sequential passes over the
-memory. In particular, the {\em partitioned-join} strategy first partitions
+memory. In particular, the @emph{partitioned-join} strategy first partitions
 (or clusters) both inner and outer relations into small clusters, such that
 each cluster fits the smallest memory cache. The partitioned-join then only
 has to combine tuples from corresponding clusters. This module provides two
 partitioned-join algorithms:
-\begin{itemize}
-\item the {\em phash join} algorithm that performs hash-join on the matching 
clusters.
-\item as an alternative, we provide {\em radix join} that performs nested loop
[EMAIL PROTECTED]
[EMAIL PROTECTED] 
+the @emph{phash join} algorithm that performs hash-join on the matching 
clusters.
[EMAIL PROTECTED] 
+as an alternative, we provide @emph{radix join} that performs nested loop
 join on the matching clusters.
-\end{itemize}
[EMAIL PROTECTED] itemize
 Phash join needs clusters where the clusters of the inner relation plus hash
 table fit the smallest cache. If we assume that tuples in the inner relation
 (plus hash table) occupy 16 bytes and the L1 cache is 16KB, we need clusters
@@ -120,10 +120,10 @@
 the cluster operation itself will start to generate a huge number of cache 
misses.
 This reduces the efficiency of partitioned join.
 
-The {\em radix-cluster} algorithm proposed
+The @emph{radix-cluster} algorithm proposed
 here solves this problem by making multiple passes; in each pass the relation 
is
 subclustered on a number of radix-bits.  Radix-bits are a subsequence of bits 
taken
-from a {\em radix-number}, which usually is the integer result of hashing the 
value on which
+from a @emph{radix-number}, which usually is the integer result of hashing the 
value on which
 we want to cluster. The first pass of the radix-cluster algorithm puts all 
tuples with an
 equal bit-pattern in the higher H1 radix-bits together in a cluster. The 
second pass starts
 where the previous left off, and subdivides each cluster on the second-highest 
H2 bits.
@@ -138,11 +138,11 @@
 This makes radix-cluster even more beneficial on those platforms.
 
 @+ The CPU Story
[EMAIL PROTECTED]
-Modern CPUs are called {\em super-scalar}, by which is meant that the CPU
+Modern CPUs are called @emph{super-scalar}, by which is meant that the CPU
 has two mechanisms for parallel processing:
-\begin{enumerate}
-\item CPU instruction execution is chopped into as many as 10-25 different
[EMAIL PROTECTED]
[EMAIL PROTECTED] 
+CPU instruction execution is chopped into as many as 10-25 different
 stages, which can be executed one after the other by different pieces of 
hardware.
 These pieces of hardware form a pipeline, so each cycle a new instruction can 
enter
 the pipeline, while at the other end one leaves (or "graduates").
@@ -150,7 +150,8 @@
 hence the quicker the hardware can execute a stage, hence the higher the 
overall
 clock speed of the CPU can be. The search of ever higher CPU clock speeds hence
 explains the trend of ever longer pipelines found in modern CPUs.
-\item multiple independent pipelines may be implemented, meaning that
[EMAIL PROTECTED] 
+multiple independent pipelines may be implemented, meaning that
 two CPU instructions that are independent can be pushed each cycle into two
 different hardware pipelines for execution. Modern CPUs have at least
 2 and possibly up to 9 replicated pipelines (often separated in integer and
@@ -160,7 +161,7 @@
 a consequence these circuits are used to create replicated execution units
 organized in pipelines whose parallel activity is supposed to increase the
 performance of the CPU.
-\end{enumerate}
[EMAIL PROTECTED] enumerate
 
 All this complexity comes at a price though, which is performance 
vulnerability.
 Application code must at all times have three totally independent instructions
@@ -174,21 +175,21 @@
 it to be fully executed, we have to push a next instruction into the pipeline.
 
 This turns nasty on if-then-else code like:
-\begin{verbatim}
[EMAIL PROTECTED]
 if (A)
 then B
 else C
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 The basic problem is that just after entering "if A" in the pipeline at stage 
1,
 the CPU does not yet know whether this instruction will evaluate to true or 
false,
 hence it does not know whether the next instruction will be B or C. Modern CPUs
-resort in this situation to {\em speculative execution}, by e.g. putting B in 
the
+resort in this situation to @emph{speculative execution}, by e.g. putting B in 
the
 pipeline just because that taking the then-branch is default (a poor estimator)
 or because in a high percentage of the previous cases this piece of code was 
executed,
 "if A" turned out to evaluate to true (which is a better estimator).
 
 Clearly, the CPU will turn out to guess wrong in a certain percentage of cases
-(called the {\em mis-prediction rate}). Mis-predicting execution has 
performance
+(called the @emph{mis-prediction rate}). Mis-predicting execution has 
performance
 consequences, as the real outcome of "if A" only comes to light when the 
instruction
 is already deep into the pipeline, and many instructions have already been 
inserted
 after it. That work has to be thrown away. Suppose now C would have been the 
correct
@@ -196,7 +197,7 @@
 is then, needs to be flushed. Also, corrective action needs also to be taken 
in order
 to e.g. undo all effect of executing B and all other flushed instructions 
(e.g. by
 restoring CPU flags and registers) and we need to start over with C in stage 
1. Notice
-that a mis-prediction rate as low as 5\% on a 20-stage pipeline will typically 
cause 50%
+that a mis-prediction rate as low as 5% on a 20-stage pipeline will typically 
cause 50%
 of the pipeline to be thrown away, which already decreases performance below 
the level
 where a 20-stage pipeline at that speed is actually useful (i.e. some code 
would do better
 at a lower speed with a shorter pipeline).
@@ -211,7 +212,6 @@
 minute, that we solved our first problem, memory access).
 
 @- The CPU optimization problem
[EMAIL PROTECTED]
 Many independent studies show that CPU resource usage during most DBMS loads 
is awful,
 plagued by low prediction rates (and high number of cache misses). This 
indicates that
 typical DBMS software has a nature of being full of if-statements and branches,
@@ -223,27 +223,28 @@
 be done in DBMS code to make it CPU-wise efficient?
 
 In Monet, we apply two techniques:
-\begin{itemize}
-\item macro-driven (explicit) loop unrolling. This is often dubbed 
code-expansion.
[EMAIL PROTECTED]
[EMAIL PROTECTED] 
+macro-driven (explicit) loop unrolling. This is often dubbed code-expansion.
 Loop unrolling is a well-known technique to improve the pipelined performance
 of code that processes a bulk data structure. Regrettably, compilers can only
 detect opportunity for loop unrolling when the bounds of the bulk structure 
(array)
 are known. The sizes of arrays that store database tables are not known
 at compile time, hence the compiler needs to be helped a bit.
-\item factoring-out function calls. Function calls are an important source of
[EMAIL PROTECTED] 
+factoring-out function calls. Function calls are an important source of
 dependence among subsequent instructions. In a language like C, a function call
 may modify any reachable memory location, hence the compiler must generate 
code to
 reload many values that are cached in registers. On top of that, executing a 
function
 call carries substantial stack management overhead (e.g. 20 cycles) and 
decreases
 the prediction-rate of the CPU.
-\end{itemize}
[EMAIL PROTECTED] itemize
 
 We provide our radix-algorithms in such versions that they can be experimented 
with
 with and without these optimization techniques enabled in order to monitor
 their effectiveness.
 
 @* Join Processing Optimized for Memory/CPU cost
[EMAIL PROTECTED]
 We now address the issue of optimizing generic join processing for optimal 
usage of
 CPU resources and memory hardware on super-scalar CPUs featuring long 
pipelines and
 out-of-order speculative execution and memory subsystems that consist of deep 
hierarchies
@@ -252,18 +253,17 @@
 We specifically want to compare the effectiveness of the 'Monet-approach' with 
a standard
 'relational approach'. We consider the generic join query:
 
-\begin{verbatim}
[EMAIL PROTECTED]
 SELECT larger.a1, .., larger.aY, smaller.b1, .., smaller.bZ
 FROM   larger, smaller
 WHERE  larger.key = smaller.key
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 Without loss of generality we assume that the "larger" table has the same 
amount or more
 tuples than the "smaller" table.
 
 @+ The Monet Approach
[EMAIL PROTECTED]
 In the standard approach this query would be executed in Monet with the 
following MIL statements:
-\begin{verbatim}
[EMAIL PROTECTED]
 01
 02  # join is either positional-, merge- or hash-join.
 03  res_join := join(larger_key, smaller_key.reverse);
@@ -279,13 +279,13 @@
 B1  res_b1 := join(res_larger, larger_b1);
 BX  ....
 BZ  res_bZ := join(res_larger, larger_bZ);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
 A positional join is a highly efficient kind of join found in the Monet 
system, that occurs
 when an OID-column is joined with a VOID column. A VOID column is a column 
that contains
-a sequence of densely ascending OIDs: [EMAIL PROTECTED], [EMAIL PROTECTED], 
[EMAIL PROTECTED], ..., [EMAIL PROTECTED] In its implementation,
+a sequence of densely ascending OIDs: 1@@0, 2@@0, 3@@0, ..., N@@0. In its 
implementation,
 Monet does not materialize suchOID sequences, hence the type-name "void". It 
is easy to lookup
-a value in a VOID column, as the value you look up (e.g. [EMAIL PROTECTED]) 
already tells its position (=3).
+a value in a VOID column, as the value you look up (e.g. 3@@0) already tells 
its position (=3).
 The positional join algorithms joins an outer BAT[any,oid] with an inner 
BAT[void,any]
 by scanning over the inner-BAT and performing positional lookup into the outer 
BAT.
 
@@ -294,31 +294,30 @@
 one of the key columns would be of type VOID). However, if the columns are a 
N-M relationship,
 or if they do not form a foreign key at all, the join
 would become a merge-join or a hash-join. Merge-join is only taken if both
-smaller\_key and larger\_key are already be sorted on key (tail column). As 
this cannot
+smaller_key and larger_key are already be sorted on key (tail column). As this 
cannot
 generally be assumed, normally a hash-join would be the implementation chosen 
by Monet.
-A hash-join performs well as long as the smaller\_key BAT plus its associated 
hash-table
+A hash-join performs well as long as the smaller_key BAT plus its associated 
hash-table
 (which adds about 8 bytes per tuple), is smaller than the memory cache.
 
 The latter phase of the query (lines A0-AY,B0-BZ) fetches column values from 
the projected columns
 using positional join.  This performs fine up until table sizes of the larger 
table when one
-larger\_bX column BAT starts to exceed the size of the memory cache.
+larger_bX column BAT starts to exceed the size of the memory cache.
 
 We now turn our attention to what happens if these sizes exceed. First, we 
discuss what happens
-if the larger\_bX BATs (which for simplicity we assume all have approximately 
the same size)
-do not fit the memory cache. Then, we discuss what happens if even the 
smaller\_key BAT plus
+if the larger_bX BATs (which for simplicity we assume all have approximately 
the same size)
+do not fit the memory cache. Then, we discuss what happens if even the 
smaller_key BAT plus
 its hash-table does not fit anymore.
 
 @- The Role of Sorting in improving Memory Access
[EMAIL PROTECTED]
 If the BATs storing the columns of the "larger" table do not fit the memory 
cache anymore,
 the positional joins in the last Y statements of the MIL script will start to 
generate cache
-misses. This is caused by the fact that the OIDs in the tail of the 
res\_larger BATs are
-not sorted; hence the access to the larger\_bX column BATs is random.
+misses. This is caused by the fact that the OIDs in the tail of the res_larger 
BATs are
+not sorted; hence the access to the larger_bX column BATs is random.
 
 This problem can be solved, by sorting the result of the join first on the 
OIDs that point
-to the "larger" table (the head column of res\_join):
+to the "larger" table (the head column of res_join):
 
-\begin{verbatim}
[EMAIL PROTECTED]
 ..
 03  res_join := join(larger_key, smaller_key.reverse).sort;
 04  res_larger_sorted := res_join.mark([EMAIL PROTECTED]).reverse;
@@ -328,12 +327,12 @@
 B1  res_b1 := join(res_larger_sorted, larger_b1);
 BX  ....
 BZ  res_bZ := join(res_larger_sorted, larger_bZ);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
-As a result, the res\_larger BAT will be ordered on tail, hence the positional 
joins on the larger\_bX
-columns will cause a nice sequential access to both res\_larger (as it is 
scanned in its role as
-"outer" join operand) and larger\_bX (due to the fact that the lookup values 
from res\_larger are
-now sorted). We must, however, take into account that res\_join
+As a result, the res_larger BAT will be ordered on tail, hence the positional 
joins on the larger_bX
+columns will cause a nice sequential access to both res_larger (as it is 
scanned in its role as
+"outer" join operand) and larger_bX (due to the fact that the lookup values 
from res_larger are
+now sorted). We must, however, take into account that res_join
 may be a BAT that itself is larger than the memory cache, in which case the 
sorting operation
 itself could cause a great many cache misses itself, and therefore perform 
badly.  Let us therefore
 shortly discuss the memory access properties of Monet's sorting algorithms.
@@ -349,36 +348,35 @@
 therefore takes log2(ntuples) recursion levels to sort a BAT, hence its total 
memory access consists of log2(ntuples)
 sequential scans.  However, since quicksort zooms into ever smaller sub-chunks 
of the BAT, there will
 be cache re-use in the deeper recursion levels as soon as such a chunk fits 
the memory cache, which
-happens when sizeof(chunk) = sizeof(BAT)/(2\^level) <= sizeof(cache).  Hence, 
the total memory cost of
-quicksort is log2(ntuples)-log2(sizeof(cache)/sizeof(tuple)) sequential scans.
+happens when @math{sizeof(chunk) = sizeof(BAT)/(2^level) <= sizeof(cache)}.  
Hence, the total memory cost of
+quicksort is @math{log2(ntuples)-log2(sizeof(cache)/sizeof(tuple))} sequential 
scans.
 
 In all, the Monet quicksort implementation behaves quite good both concerning 
CPU efficiency and
 memory access pattern. Still, for some simple data types (in particular 
columns containing OIDs) one
-can further improve the memory access performance by using {\em radix-sort} 
instead of quicksort.
+can further improve the memory access performance by using @emph{radix-sort} 
instead of quicksort.
 
 Radix-sort is essentially a radix-cluster on all bits, hence we do:
 
-\begin{verbatim}
[EMAIL PROTECTED]
 ..
 03 res_join := join(larger_key, 
smaller_key.reverse).reverse.radix_cluster(R1,..,Rp).reverse;
 ..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
 Where p is a suitable number of passes and R=R1+..+Rp is the total number of 
"significant bits"
 (the most significant bit in a collection of integer values is the highest bit 
set in all values).
-The head column of the join(larger\_key, smaller\_key.reverse) is of type OID, 
and contains
+The head column of the join(larger_key, smaller_key.reverse) is of type OID, 
and contains
 the OIDs from the matching tuples in the larger table. Table-OID are 
automatically generated by
 the VOID columns of Monet, and therefore these integer values are from the 
range [0,...,N], where N
 is the number of tuples in the "larger" table. We call such an integer
 sub-domain a "dense" domain. As a result, the number of significant bits is 
minimal (i.e. R=log2(N),
 there are no "spoiled" values), and we do not expect skew in such a column. 
This motivates our choice
-to implement radix-cluster for the OID type by getting radix bits {\em 
without} hashing (for all other
+to implement radix-cluster for the OID type by getting radix bits 
@emph{without} hashing (for all other
 types, we hash first). Hashing is not necessary due to absence of value-skew 
on OID columns, and
 absence of hashing allows us to use radix-cluster as radix-sort.
 
 @- Partitioned Hash Join
[EMAIL PROTECTED]
-We now discuss the case that even the smaller\_key BAT with its hash structure 
does not fit the
+We now discuss the case that even the smaller_key BAT with its hash structure 
does not fit the
 smallest cache. What happens then in the join-phase? Since the hash-join 
algorithm exhibits a random
 access pattern, compulsory cache misses will start to appear up to the point 
that each access to
 the hash-table will be a miss. Even in the direct hashing employed by Monet, 
this amounts to at least
@@ -391,14 +389,14 @@
 
 In the situation mentioned above, performance can be improved by using 
partitioned hash-join,
 as presented earlier in this module, instead of simple hash-join. The 
partitioned hash-join uses
-radix-cluster to quickly cluster both the smaller\_key and larger\_key BATs 
into clusters that fit
+radix-cluster to quickly cluster both the smaller_key and larger_key BATs into 
clusters that fit
 the memory cache, and then repeatedly performs hash-join on the corresponding 
clusters. In this way,
 the random access is restricted to areas that fit the memory cache, hence the 
expensive cache misses
 disappear (mostly).
 
-This is realized in Monet by using radix-clustering both relations on H bits, 
e.g. larger\_key
-in l passes, and smaller\_key in s passes, such that H = L1+..+Ll = S1+..+Ss:
-\begin{verbatim}
+This is realized in Monet by using radix-clustering both relations on H bits, 
e.g. larger_key
+in l passes, and smaller_key in s passes, such that H = L1+..+Ll = S1+..+Ss:
[EMAIL PROTECTED]
 00  # first radix cluster both key columns on H bits (maybe different number 
of passes and bit settings)
 01  cluster_larger := radix_cluster(larger_key, L1,..,Ll);
 02  cluster_smaller := radix_cluster(smaller_key, S1,..,Ss);
@@ -406,77 +404,77 @@
     # partitioned hash join on clusters of H radix-bits.
 03  res_join := phash_join(cluster_larger, cluster_smaller.reverse, H);
 ..
-\end{verbatim}
-Line 03 above uses phash\_join, but could alternatively use radix-join:
-\begin{verbatim}
[EMAIL PROTECTED] verbatim
+Line 03 above uses phash_join, but could alternatively use radix-join:
[EMAIL PROTECTED]
 ..
 03  res_join := radix_join(cluster_larger, cluster_smaller.reverse, H);
 ..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
 From this point on, the same code as in the previous MIL script could be 
applied to fetch the column
 values for columns a1..aY from "smaller" and b1..bZ from "larger". The 
remaining problem here is that both
-the larger\_bX *and* the smaller\_aX BATs will tend to be bigger than the 
smallest memory cache (though this
+the larger_bX *and* the smaller_aX BATs will tend to be bigger than the 
smallest memory cache (though this
 also depends on the join hit-ratio, but let use suppose it is >= 1).
 
 To improve this, we can sort on the OIDs from the "larger" table, like 
described in the previous section.
-This will enhance the access pattern of the subsequent positional joins to the 
larger\_bX column BATs:
+This will enhance the access pattern of the subsequent positional joins to the 
larger_bX column BATs:
 
-\begin{verbatim}
[EMAIL PROTECTED]
 ..
     # partitioned hash join on clusters of H radix-bits, followed by 
radix-sort on head column.
 03  res_join := phash_join(cluster_larger, cluster_smaller.reverse, 
H).reverse.radix_cluster(R1,..,Rp).reverse;
 04  res_larger_sorted := res_join.mark([EMAIL PROTECTED]).reverse;
 05  res_smaller := res_join.reverse.mark([EMAIL PROTECTED]).reverse;
 ..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
-Now, as the smaller\_aX column BATs probably are also larger than the memory 
cache, we would like to do
-the same for the "smaller" table. But, we then cannot sort res\_join twice. We 
could sort res\_smaller on tail
+Now, as the smaller_aX column BATs probably are also larger than the memory 
cache, we would like to do
+the same for the "smaller" table. But, we then cannot sort res_join twice. We 
could sort res_smaller on tail
 after line 05:
-\begin{verbatim}
[EMAIL PROTECTED]
 ..
 05  res_smaller := res_join.reverse.mark([EMAIL 
PROTECTED]).reverse.radix_cluster(R1,..,Rp);
 ..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 However, this approach of the problem only transfers the problem to later 
phases of query processing.
-The positional joins would run fine, but as a tail-sorted res\_smaller would 
be a BAT[oid,oid] (i.e.
-it would no longer have a VOID head column), the result of the positional 
joins with the smaller\_aX
+The positional joins would run fine, but as a tail-sorted res_smaller would be 
a BAT[oid,oid] (i.e.
+it would no longer have a VOID head column), the result of the positional 
joins with the smaller_aX
 BAT[void,T]s would be of the form BAT[oid,T]. These results would not only 
take more space than the
 desired form BAT[void,T], but would also create a problem in further use of 
the query result, as these
-res\_aX BATs will not be sorted on head. Join access to them would go to the 
hash-join rather than the
+res_aX BATs will not be sorted on head. Join access to them would go to the 
hash-join rather than the
 positional join, and due to the random access this would pose a memory caching 
problem as these
-res\_aX BATs tend to be larger than the memory cache.
+res_aX BATs tend to be larger than the memory cache.
 
 Therefore, for these projection joins, we propose the use of a new 
memory-conscious join algorithm that is
-called {\em clustered positional join} which implements a join(BAT[void,oid] 
L, BAT[void,T] R) : BAT[void,T]
+called @emph{clustered positional join} which implements a join(BAT[void,oid] 
L, BAT[void,T] R) : BAT[void,T]
 
 This algorithm consists of three phases, of which we already know the first 
two:
-\begin{description}
-\item[partial radix-cluster]
-First, the res\_smaller is {\em partially} radix-clustered on tail-OID. That 
is, the relation L (= res\_smaller in
-the positional joins to the column BATs smaller\_aX) is clustered on some 
number of highest significant
[EMAIL PROTECTED] @code
[EMAIL PROTECTED] [partial radix-cluster]
+First, the res_smaller is @emph{partially} radix-clustered on tail-OID. That 
is, the relation L (= res_smaller in
+the positional joins to the column BATs smaller_aX) is clustered on some 
number of highest significant
 radix-bits, but not on all radix-bits. Because the radix-cluster on OIDs does 
not use a hash-function,
 clustering an OID column on all significant bits radix-sorts it, or - as in 
this case - on a subset of the
 highest significant bits, partially orders it. The partial ordering of OIDs in 
chunks is done in such a way that the
 size of the corresponding chunk in R (remember R is a BAT[void,T] and has all 
OIDs in a densely ascending sequence) fits
 the memory cache.
-\item[positional join]
[EMAIL PROTECTED] [positional join]
 The purpose of the radix-clustering in the previous phase L is to accelerate 
the positional join between L and R
-(i.e. res\_smaller and smaller\_aX).  Because the OIDs in the tail of L are 
now partially sorted, each chunk in L
+(i.e. res_smaller and smaller_aX).  Because the OIDs in the tail of L are now 
partially sorted, each chunk in L
 will only randomly access data from one chunk in R.  Therefore, during 
positional join, these chunks in R stay
 memory resident, accelerating performance with respect to to doing the same 
with a non-clustered L (where each access to
 R would be a cache miss).
-\item[radix decluster]
[EMAIL PROTECTED] [radix decluster]
 The result of the positional join is a BAT[oid,T]. Still, we know that the 
head column, when sorted, would
 form an void column. What is now the fastest way to sort it and convert it 
back to a void BAT? One special property
 of the radix-cluster algorithm is that when we cluster on tail column, each 
result chunk will have the head-values
-in order. In this case, the clustered version of L (res\_smaller\_clustered, 
see below) has the head OIDs in order {\em within the
+in order. In this case, the clustered version of L (res_smaller_clustered, see 
below) has the head OIDs in order @emph{within the
 chunk}. This sub-ordering is also carried over by the positional join to 
result of the positional join: in each
 'virtual chunk' in the BAT[oid,T], the OIDs appear in order. Therefore, we can 
perform an merge operation to merge
 all BAT[oid,T] chunks into a BAT[void,T] result. Normally, the cost of a merge 
is at least O(log(P)*N), where N
 is the total number of tuples, and P is the number of chunks. By using the 
special property that eventually,
-the merged OIDs form a densely ascending sequence ([EMAIL PROTECTED], [EMAIL 
PROTECTED],..,[EMAIL PROTECTED]), we can bring this cost back to O(N)! This 
{\em radix decluster}
+the merged OIDs form a densely ascending sequence (0@@0, 1@@0,..,N@@0), we can 
bring this cost back to O(N)! This @emph{radix decluster}
 algorithm keeps a windows open of OIDs [windowStart, windowStart+1, ..., 
windowStart+windowSize-1] during the
 merge. Each iteration of the algorithm finds all the next windowSize OIDs in 
the chunks and inserts them
 in the result BAT[void,T]. This is done by going to through all (not yet 
empty) chunks and inserting from the top of
@@ -487,10 +485,10 @@
 in the chunk, the chunk cache lines will be re-used and performance will be 
good. The only restriction on the windowSize
 is that the insertion window on the output BAT must fit the memory cache. This 
will only start to exceed on very
 large table sizes (a possible remedy is then to perform the merge in multiple 
passes).
-\end{description}
[EMAIL PROTECTED] table
 
 In all, in Monet this join strategy is expressed in the following MIL:
-\begin{verbatim}
[EMAIL PROTECTED]
 ..
     # subcluster on Rs significant radix-bits, ignoring lowest Ri bits
 05  res_smaller_clustered := res_join.reverse.mark([EMAIL 
PROTECTED]).reverse.radix_cluster(-Ri, Rs);
@@ -501,7 +499,7 @@
 AX   ....
 AY  res_aY := join(res_smaller_clustered, 
smaller_aY).radix_decluster(borders_smaller);
 ..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
 Possibly, one could consider to use this approach as well for the projections 
on the larger table
 (instead of the sort). However, sorting once (with radix-sort by 
radix-clustering on all significant
@@ -514,7 +512,7 @@
 
 The full MIL script when the individual BATs of both the "smaller" and 
"larger" tables as well as
 the (N-M) join result, exceed the memory cache becomes:
-\begin{verbatim}
[EMAIL PROTECTED]
     # first radix cluster both key columns on H bits (maybe different number 
of passes and bit settings)
 01  cluster_larger := radix_cluster(larger_key, L1,..,Ll);
 02  cluster_smaller := radix_cluster(smaller_key, S1,..,Ss);
@@ -536,10 +534,9 @@
 B1  res_b1 := join(res_larger_sorted, larger_b1);
 BX  ....
 BZ  res_bZ := join(res_larger_sorted, larger_bZ);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
 @+ The Relational Approach
[EMAIL PROTECTED]
 A cache-conscious join in a relational DBMS would first radix-cluster both the 
smaller and larger table,
 where in the process it would project on just the selected columns. As the 
relational model does not
 separate its algebraic actions by column, as Monet in MIL does, it cannot use 
the technique of type-expansion
@@ -561,7 +558,7 @@
 all other columns.
 
 The entire MIL sequence are then the following 5 statements:
-\begin{verbatim}
[EMAIL PROTECTED]
 01  smaller_all := [integer]([integer](smaller_key,1).reverse, 128).reverse;
 02  larger_all := [integer]([integer](larger_key,1).reverse, 128).reverse;
 
@@ -571,10 +568,10 @@
 05  cluster_smaller := radix_cluster(smaller_view, S1,..,Ss);
 06  cluster_larger := radix_cluster(larger_view, L1,..,Ll);
 07  res_join := phash_join(cluster_larger, cluster_smaller.reverse, H);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 
-Notice that the fact that smaller\_view and larger\_view are MIL views on the 
base BATs
-smaller\_all and larger\_all, means that the projection is never materialized. 
Projecting
+Notice that the fact that smaller_view and larger_view are MIL views on the 
base BATs
+smaller_all and larger_all, means that the projection is never materialized. 
Projecting
 is done on the fly during the first pass of radix cluster, just like what 
would happen in
 a relational system. What is more, the copying of each integer8 (view) value 
from its
 storage as integer128 is done with 8 memcpy() calls that fetch values at 
regular intervals
@@ -590,9 +587,9 @@
 
 Therefore, it might even be that the increased cost of radix-cluster will make 
simple
 hash-join faster than partitioned hash-join:
-\begin{verbatim}
[EMAIL PROTECTED]
 05  res_join := join(larger_view, smaller_view.reverse);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
 In either way, we expect a relational performance to be a factor 10 slower 
than Monet
 on big sizes of both the "smaller" and "larger" tables with hit rates such 
that the result table is big.
 
@@ -1714,7 +1711,6 @@
 
 
 @+ Radix Cluster
[EMAIL PROTECTED]
 In radix cluster we want to deliver one new BAT that consists
 of a consecutive memory area (like all BATs do) with the tuples
 clustered on a certain radix. To do this correctly in one scan
@@ -1736,25 +1732,30 @@
 TODO (easy): make N tunable
 
 functions:
-\begin{itemize}
-\item radix\_buns()    does the basic clustering stuff (95% of effort)
-\item cnt\_buns()      is the histogram scan that is triggered when we need
-                       to move data
-\item move\_buns()     does the moving work. This is tricky; as copy
-                       dependencies bind you to a certain schedule.
-\item radix\_chunk()   is the main routine that does one radix scan and
-                       produces a new (chunk of a) bat. Does so by first going
-                       for uniform distributions and executing radix\_buns().
-                       If a buffer overflows, this stops and cnt\_buns() is
-                       done. The buffers are then moved into correct position
-                       by move\_buns(). Clustering is then finished by again
-                       radix\_buns().
-\item radix\_cluster() is the multilevel radix cluster routine. On the first
-                       level; it processes 1 chunk and produced N1 new ones.
-                       On the second level, N1 chunks are processed and divided
-                       each in N2 new ones (making for N1*N2 clusters), etc.
-                       For clustering each chunk, it calls radix\_chunk().
-\end{itemize}
[EMAIL PROTECTED]
[EMAIL PROTECTED] radix_buns()  
+does the basic clustering stuff (95% of effort)
[EMAIL PROTECTED] cnt_buns()    
+is the histogram scan that is triggered when we need
+to move data
[EMAIL PROTECTED] move_buns()   
+does the moving work. This is tricky; as copy
+dependencies bind you to a certain schedule.
[EMAIL PROTECTED] radix_chunk() 
+is the main routine that does one radix scan and
+produces a new (chunk of a) bat. Does so by first going
+for uniform distributions and executing radix_buns().
+If a buffer overflows, this stops and cnt_buns() is
+done. The buffers are then moved into correct position
+by move_buns(). Clustering is then finished by again
+radix_buns().
[EMAIL PROTECTED] radix_cluster()       
+is the multilevel radix cluster routine. On the first
+level; it processes 1 chunk and produced N1 new ones.
+On the second level, N1 chunks are processed and divided
+each in N2 new ones (making for N1*N2 clusters), etc.
+For clustering each chunk, it calls radix_chunk().
[EMAIL PROTECTED] itemize
 @c
 #define any_RADIX(p,rd) ((hash_t)((*BATatoms[any].atomHash)(p) & rd))
 #define oid_RADIX(p,rd) ((hash_t) *(oid*) (p) & rd)
@@ -2997,7 +2998,7 @@
 its non-sorted OID head column into a sorted and densely ascending void 
column. The tail
 itself does not need to contain the partially sorted values; in fact we do not 
even look
 at those values but directly take the cluster boundaries as an input 
parameter. This input
-parameter is a radix\_count BAT that must have been created with the 
radix\_count command.
+parameter is a radix_count BAT that must have been created with the 
radix_count command.
 
 @= radix_decluster
 {
@@ -3175,13 +3176,12 @@
 }
 
 @- radix_decluster
[EMAIL PROTECTED]
 intends to improve performance by using a cheaper positional join:
        join(bat[void,oid], bat[void,T])
 instead of
        join(bat[oid,oid], bat[void,T])
 
-price paid are two parameters to radix\_decluster.
+price paid are two parameters to radix_decluster.
 
 @= radix_decluster2
 {

Index: batxml.mx
===================================================================
RCS file: /cvsroot/monetdb/MonetDB5/src/modules/mal/batxml.mx,v
retrieving revision 1.23
retrieving revision 1.24
diff -u -d -r1.23 -r1.24
--- batxml.mx   11 Jan 2008 10:41:37 -0000      1.23
+++ batxml.mx   8 Feb 2008 22:36:34 -0000       1.24
@@ -797,6 +797,7 @@
        BBPunfix(e->batCacheid);
        BBPunfix(g->batCacheid);
        BBPunfix(b->batCacheid);
+       bn->tsorted = 0;
        BBPkeepref(*ret=bn->batCacheid);
        return MAL_SUCCEED;
 bunins_failed:
@@ -867,6 +868,7 @@
        BBPunfix(r->batCacheid);
        BBPunfix(j->batCacheid);
        GDKfree(buf);
+       bn->tsorted = 0;
        BBPkeepref(*ret=bn->batCacheid);
        return MAL_SUCCEED;
 bunins_failed:


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Monetdb-checkins mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-checkins

Reply via email to