Update of /cvsroot/monetdb/MonetDB5/src/modules/mal
In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv15857/src/modules/mal
Modified Files:
batxml.mx radix.mx
Log Message:
propagated changes of Sunday Feb 03 2008 - Friday Feb 08 2008
from the MonetDB_5-4 branch to the development trunk
Index: radix.mx
===================================================================
RCS file: /cvsroot/monetdb/MonetDB5/src/modules/mal/radix.mx,v
retrieving revision 1.15
retrieving revision 1.16
diff -u -d -r1.15 -r1.16
--- radix.mx 11 Jan 2008 10:41:38 -0000 1.15
+++ radix.mx 8 Feb 2008 22:36:34 -0000 1.16
@@ -28,7 +28,6 @@
are applied in Monet's join processing strategy.
@+ The Memory Story
[EMAIL PROTECTED]
Computer RAM carries the acronym Random Access Memory, indicating that
the memory access speed is independent of memory location. While this
is still (mostly) true, the imbalance in speed improvements between CPU
@@ -43,7 +42,7 @@
and a L2 (level-two) cache (typical size 256-2MB, typical latency 10-30ns).
The L1
and sometimes even the L2 are now located on the CPU chip itself in order to
reduce latency.
-Cache memories are organized in {\em cache lines} of a fixed width. A typical
width for
+Cache memories are organized in @emph{cache lines} of a fixed width. A typical
width for
an L1 line is 32 bytes, whereas a line L2 can be 32-128 bytes long. A cache
line
is the smallest unit of transfer, which means that on a miss, the memory
system fetches
all bytes of the line from the lower levels of the memory hierarchy in one go.
@@ -52,14 +51,14 @@
as the different bytes of a cache line arrive in a sequential burst, but the
total difference
in arrival-time tends to be no more than 4 cycles apart from first to last
byte. This design of
the memory caches (on all levels, but with varying size and line width
parameters) has consequences
-for application performance, whose severity depend on what kind of {\em memory
access patterns}
+for application performance, whose severity depend on what kind of
@emph{memory access patterns}
the application exhibits. Reading (or writing) a memory location that is not
in the cache, causes
a miss, and the CPU is forced for waiting the latency period. It is obvious
that subsequent
reading of this exact same data will not cause sub-sequent cache misses, and
be fast -- because
this data is already in cache. But what happens with a sequential access
pattern to data that
is not in the cache? Again, the first read causes a cache miss. However,
subsequent reads
to adjacent bytes access the same cache line that is already loaded and
therefore
-do {\em not} cause any cache misses. This is the reason why sequential memory
access
+do @emph{not} cause any cache misses. This is the reason why sequential memory
access
is cheaper than random access, as the latter pattern may cause a cache miss on
every
read (considering the access is to an uncached memory region).
@@ -73,7 +72,6 @@
their time on memory cache misses.
@- Optimizing Memory Access during Join
[EMAIL PROTECTED]
We focus here on improving the memory access performance of the join
operator in order to gain performance. This is relevant, because the
most popular main-memory join algorithm is hash-join, which exhibits a
@@ -94,16 +92,18 @@
the access pattern of the join operation, in such a way that the overall
number of cache misses is reduced. This is typically achieved by doing
extra CPU work and/or by making additional sequential passes over the
-memory. In particular, the {\em partitioned-join} strategy first partitions
+memory. In particular, the @emph{partitioned-join} strategy first partitions
(or clusters) both inner and outer relations into small clusters, such that
each cluster fits the smallest memory cache. The partitioned-join then only
has to combine tuples from corresponding clusters. This module provides two
partitioned-join algorithms:
-\begin{itemize}
-\item the {\em phash join} algorithm that performs hash-join on the matching
clusters.
-\item as an alternative, we provide {\em radix join} that performs nested loop
[EMAIL PROTECTED]
[EMAIL PROTECTED]
+the @emph{phash join} algorithm that performs hash-join on the matching
clusters.
[EMAIL PROTECTED]
+as an alternative, we provide @emph{radix join} that performs nested loop
join on the matching clusters.
-\end{itemize}
[EMAIL PROTECTED] itemize
Phash join needs clusters where the clusters of the inner relation plus hash
table fit the smallest cache. If we assume that tuples in the inner relation
(plus hash table) occupy 16 bytes and the L1 cache is 16KB, we need clusters
@@ -120,10 +120,10 @@
the cluster operation itself will start to generate a huge number of cache
misses.
This reduces the efficiency of partitioned join.
-The {\em radix-cluster} algorithm proposed
+The @emph{radix-cluster} algorithm proposed
here solves this problem by making multiple passes; in each pass the relation
is
subclustered on a number of radix-bits. Radix-bits are a subsequence of bits
taken
-from a {\em radix-number}, which usually is the integer result of hashing the
value on which
+from a @emph{radix-number}, which usually is the integer result of hashing the
value on which
we want to cluster. The first pass of the radix-cluster algorithm puts all
tuples with an
equal bit-pattern in the higher H1 radix-bits together in a cluster. The
second pass starts
where the previous left off, and subdivides each cluster on the second-highest
H2 bits.
@@ -138,11 +138,11 @@
This makes radix-cluster even more beneficial on those platforms.
@+ The CPU Story
[EMAIL PROTECTED]
-Modern CPUs are called {\em super-scalar}, by which is meant that the CPU
+Modern CPUs are called @emph{super-scalar}, by which is meant that the CPU
has two mechanisms for parallel processing:
-\begin{enumerate}
-\item CPU instruction execution is chopped into as many as 10-25 different
[EMAIL PROTECTED]
[EMAIL PROTECTED]
+CPU instruction execution is chopped into as many as 10-25 different
stages, which can be executed one after the other by different pieces of
hardware.
These pieces of hardware form a pipeline, so each cycle a new instruction can
enter
the pipeline, while at the other end one leaves (or "graduates").
@@ -150,7 +150,8 @@
hence the quicker the hardware can execute a stage, hence the higher the
overall
clock speed of the CPU can be. The search of ever higher CPU clock speeds hence
explains the trend of ever longer pipelines found in modern CPUs.
-\item multiple independent pipelines may be implemented, meaning that
[EMAIL PROTECTED]
+multiple independent pipelines may be implemented, meaning that
two CPU instructions that are independent can be pushed each cycle into two
different hardware pipelines for execution. Modern CPUs have at least
2 and possibly up to 9 replicated pipelines (often separated in integer and
@@ -160,7 +161,7 @@
a consequence these circuits are used to create replicated execution units
organized in pipelines whose parallel activity is supposed to increase the
performance of the CPU.
-\end{enumerate}
[EMAIL PROTECTED] enumerate
All this complexity comes at a price though, which is performance
vulnerability.
Application code must at all times have three totally independent instructions
@@ -174,21 +175,21 @@
it to be fully executed, we have to push a next instruction into the pipeline.
This turns nasty on if-then-else code like:
-\begin{verbatim}
[EMAIL PROTECTED]
if (A)
then B
else C
-\end{verbatim}
[EMAIL PROTECTED] verbatim
The basic problem is that just after entering "if A" in the pipeline at stage
1,
the CPU does not yet know whether this instruction will evaluate to true or
false,
hence it does not know whether the next instruction will be B or C. Modern CPUs
-resort in this situation to {\em speculative execution}, by e.g. putting B in
the
+resort in this situation to @emph{speculative execution}, by e.g. putting B in
the
pipeline just because that taking the then-branch is default (a poor estimator)
or because in a high percentage of the previous cases this piece of code was
executed,
"if A" turned out to evaluate to true (which is a better estimator).
Clearly, the CPU will turn out to guess wrong in a certain percentage of cases
-(called the {\em mis-prediction rate}). Mis-predicting execution has
performance
+(called the @emph{mis-prediction rate}). Mis-predicting execution has
performance
consequences, as the real outcome of "if A" only comes to light when the
instruction
is already deep into the pipeline, and many instructions have already been
inserted
after it. That work has to be thrown away. Suppose now C would have been the
correct
@@ -196,7 +197,7 @@
is then, needs to be flushed. Also, corrective action needs also to be taken
in order
to e.g. undo all effect of executing B and all other flushed instructions
(e.g. by
restoring CPU flags and registers) and we need to start over with C in stage
1. Notice
-that a mis-prediction rate as low as 5\% on a 20-stage pipeline will typically
cause 50%
+that a mis-prediction rate as low as 5% on a 20-stage pipeline will typically
cause 50%
of the pipeline to be thrown away, which already decreases performance below
the level
where a 20-stage pipeline at that speed is actually useful (i.e. some code
would do better
at a lower speed with a shorter pipeline).
@@ -211,7 +212,6 @@
minute, that we solved our first problem, memory access).
@- The CPU optimization problem
[EMAIL PROTECTED]
Many independent studies show that CPU resource usage during most DBMS loads
is awful,
plagued by low prediction rates (and high number of cache misses). This
indicates that
typical DBMS software has a nature of being full of if-statements and branches,
@@ -223,27 +223,28 @@
be done in DBMS code to make it CPU-wise efficient?
In Monet, we apply two techniques:
-\begin{itemize}
-\item macro-driven (explicit) loop unrolling. This is often dubbed
code-expansion.
[EMAIL PROTECTED]
[EMAIL PROTECTED]
+macro-driven (explicit) loop unrolling. This is often dubbed code-expansion.
Loop unrolling is a well-known technique to improve the pipelined performance
of code that processes a bulk data structure. Regrettably, compilers can only
detect opportunity for loop unrolling when the bounds of the bulk structure
(array)
are known. The sizes of arrays that store database tables are not known
at compile time, hence the compiler needs to be helped a bit.
-\item factoring-out function calls. Function calls are an important source of
[EMAIL PROTECTED]
+factoring-out function calls. Function calls are an important source of
dependence among subsequent instructions. In a language like C, a function call
may modify any reachable memory location, hence the compiler must generate
code to
reload many values that are cached in registers. On top of that, executing a
function
call carries substantial stack management overhead (e.g. 20 cycles) and
decreases
the prediction-rate of the CPU.
-\end{itemize}
[EMAIL PROTECTED] itemize
We provide our radix-algorithms in such versions that they can be experimented
with
with and without these optimization techniques enabled in order to monitor
their effectiveness.
@* Join Processing Optimized for Memory/CPU cost
[EMAIL PROTECTED]
We now address the issue of optimizing generic join processing for optimal
usage of
CPU resources and memory hardware on super-scalar CPUs featuring long
pipelines and
out-of-order speculative execution and memory subsystems that consist of deep
hierarchies
@@ -252,18 +253,17 @@
We specifically want to compare the effectiveness of the 'Monet-approach' with
a standard
'relational approach'. We consider the generic join query:
-\begin{verbatim}
[EMAIL PROTECTED]
SELECT larger.a1, .., larger.aY, smaller.b1, .., smaller.bZ
FROM larger, smaller
WHERE larger.key = smaller.key
-\end{verbatim}
[EMAIL PROTECTED] verbatim
Without loss of generality we assume that the "larger" table has the same
amount or more
tuples than the "smaller" table.
@+ The Monet Approach
[EMAIL PROTECTED]
In the standard approach this query would be executed in Monet with the
following MIL statements:
-\begin{verbatim}
[EMAIL PROTECTED]
01
02 # join is either positional-, merge- or hash-join.
03 res_join := join(larger_key, smaller_key.reverse);
@@ -279,13 +279,13 @@
B1 res_b1 := join(res_larger, larger_b1);
BX ....
BZ res_bZ := join(res_larger, larger_bZ);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
A positional join is a highly efficient kind of join found in the Monet
system, that occurs
when an OID-column is joined with a VOID column. A VOID column is a column
that contains
-a sequence of densely ascending OIDs: [EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], ..., [EMAIL PROTECTED] In its implementation,
+a sequence of densely ascending OIDs: 1@@0, 2@@0, 3@@0, ..., N@@0. In its
implementation,
Monet does not materialize suchOID sequences, hence the type-name "void". It
is easy to lookup
-a value in a VOID column, as the value you look up (e.g. [EMAIL PROTECTED])
already tells its position (=3).
+a value in a VOID column, as the value you look up (e.g. 3@@0) already tells
its position (=3).
The positional join algorithms joins an outer BAT[any,oid] with an inner
BAT[void,any]
by scanning over the inner-BAT and performing positional lookup into the outer
BAT.
@@ -294,31 +294,30 @@
one of the key columns would be of type VOID). However, if the columns are a
N-M relationship,
or if they do not form a foreign key at all, the join
would become a merge-join or a hash-join. Merge-join is only taken if both
-smaller\_key and larger\_key are already be sorted on key (tail column). As
this cannot
+smaller_key and larger_key are already be sorted on key (tail column). As this
cannot
generally be assumed, normally a hash-join would be the implementation chosen
by Monet.
-A hash-join performs well as long as the smaller\_key BAT plus its associated
hash-table
+A hash-join performs well as long as the smaller_key BAT plus its associated
hash-table
(which adds about 8 bytes per tuple), is smaller than the memory cache.
The latter phase of the query (lines A0-AY,B0-BZ) fetches column values from
the projected columns
using positional join. This performs fine up until table sizes of the larger
table when one
-larger\_bX column BAT starts to exceed the size of the memory cache.
+larger_bX column BAT starts to exceed the size of the memory cache.
We now turn our attention to what happens if these sizes exceed. First, we
discuss what happens
-if the larger\_bX BATs (which for simplicity we assume all have approximately
the same size)
-do not fit the memory cache. Then, we discuss what happens if even the
smaller\_key BAT plus
+if the larger_bX BATs (which for simplicity we assume all have approximately
the same size)
+do not fit the memory cache. Then, we discuss what happens if even the
smaller_key BAT plus
its hash-table does not fit anymore.
@- The Role of Sorting in improving Memory Access
[EMAIL PROTECTED]
If the BATs storing the columns of the "larger" table do not fit the memory
cache anymore,
the positional joins in the last Y statements of the MIL script will start to
generate cache
-misses. This is caused by the fact that the OIDs in the tail of the
res\_larger BATs are
-not sorted; hence the access to the larger\_bX column BATs is random.
+misses. This is caused by the fact that the OIDs in the tail of the res_larger
BATs are
+not sorted; hence the access to the larger_bX column BATs is random.
This problem can be solved, by sorting the result of the join first on the
OIDs that point
-to the "larger" table (the head column of res\_join):
+to the "larger" table (the head column of res_join):
-\begin{verbatim}
[EMAIL PROTECTED]
..
03 res_join := join(larger_key, smaller_key.reverse).sort;
04 res_larger_sorted := res_join.mark([EMAIL PROTECTED]).reverse;
@@ -328,12 +327,12 @@
B1 res_b1 := join(res_larger_sorted, larger_b1);
BX ....
BZ res_bZ := join(res_larger_sorted, larger_bZ);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
-As a result, the res\_larger BAT will be ordered on tail, hence the positional
joins on the larger\_bX
-columns will cause a nice sequential access to both res\_larger (as it is
scanned in its role as
-"outer" join operand) and larger\_bX (due to the fact that the lookup values
from res\_larger are
-now sorted). We must, however, take into account that res\_join
+As a result, the res_larger BAT will be ordered on tail, hence the positional
joins on the larger_bX
+columns will cause a nice sequential access to both res_larger (as it is
scanned in its role as
+"outer" join operand) and larger_bX (due to the fact that the lookup values
from res_larger are
+now sorted). We must, however, take into account that res_join
may be a BAT that itself is larger than the memory cache, in which case the
sorting operation
itself could cause a great many cache misses itself, and therefore perform
badly. Let us therefore
shortly discuss the memory access properties of Monet's sorting algorithms.
@@ -349,36 +348,35 @@
therefore takes log2(ntuples) recursion levels to sort a BAT, hence its total
memory access consists of log2(ntuples)
sequential scans. However, since quicksort zooms into ever smaller sub-chunks
of the BAT, there will
be cache re-use in the deeper recursion levels as soon as such a chunk fits
the memory cache, which
-happens when sizeof(chunk) = sizeof(BAT)/(2\^level) <= sizeof(cache). Hence,
the total memory cost of
-quicksort is log2(ntuples)-log2(sizeof(cache)/sizeof(tuple)) sequential scans.
+happens when @math{sizeof(chunk) = sizeof(BAT)/(2^level) <= sizeof(cache)}.
Hence, the total memory cost of
+quicksort is @math{log2(ntuples)-log2(sizeof(cache)/sizeof(tuple))} sequential
scans.
In all, the Monet quicksort implementation behaves quite good both concerning
CPU efficiency and
memory access pattern. Still, for some simple data types (in particular
columns containing OIDs) one
-can further improve the memory access performance by using {\em radix-sort}
instead of quicksort.
+can further improve the memory access performance by using @emph{radix-sort}
instead of quicksort.
Radix-sort is essentially a radix-cluster on all bits, hence we do:
-\begin{verbatim}
[EMAIL PROTECTED]
..
03 res_join := join(larger_key,
smaller_key.reverse).reverse.radix_cluster(R1,..,Rp).reverse;
..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
Where p is a suitable number of passes and R=R1+..+Rp is the total number of
"significant bits"
(the most significant bit in a collection of integer values is the highest bit
set in all values).
-The head column of the join(larger\_key, smaller\_key.reverse) is of type OID,
and contains
+The head column of the join(larger_key, smaller_key.reverse) is of type OID,
and contains
the OIDs from the matching tuples in the larger table. Table-OID are
automatically generated by
the VOID columns of Monet, and therefore these integer values are from the
range [0,...,N], where N
is the number of tuples in the "larger" table. We call such an integer
sub-domain a "dense" domain. As a result, the number of significant bits is
minimal (i.e. R=log2(N),
there are no "spoiled" values), and we do not expect skew in such a column.
This motivates our choice
-to implement radix-cluster for the OID type by getting radix bits {\em
without} hashing (for all other
+to implement radix-cluster for the OID type by getting radix bits
@emph{without} hashing (for all other
types, we hash first). Hashing is not necessary due to absence of value-skew
on OID columns, and
absence of hashing allows us to use radix-cluster as radix-sort.
@- Partitioned Hash Join
[EMAIL PROTECTED]
-We now discuss the case that even the smaller\_key BAT with its hash structure
does not fit the
+We now discuss the case that even the smaller_key BAT with its hash structure
does not fit the
smallest cache. What happens then in the join-phase? Since the hash-join
algorithm exhibits a random
access pattern, compulsory cache misses will start to appear up to the point
that each access to
the hash-table will be a miss. Even in the direct hashing employed by Monet,
this amounts to at least
@@ -391,14 +389,14 @@
In the situation mentioned above, performance can be improved by using
partitioned hash-join,
as presented earlier in this module, instead of simple hash-join. The
partitioned hash-join uses
-radix-cluster to quickly cluster both the smaller\_key and larger\_key BATs
into clusters that fit
+radix-cluster to quickly cluster both the smaller_key and larger_key BATs into
clusters that fit
the memory cache, and then repeatedly performs hash-join on the corresponding
clusters. In this way,
the random access is restricted to areas that fit the memory cache, hence the
expensive cache misses
disappear (mostly).
-This is realized in Monet by using radix-clustering both relations on H bits,
e.g. larger\_key
-in l passes, and smaller\_key in s passes, such that H = L1+..+Ll = S1+..+Ss:
-\begin{verbatim}
+This is realized in Monet by using radix-clustering both relations on H bits,
e.g. larger_key
+in l passes, and smaller_key in s passes, such that H = L1+..+Ll = S1+..+Ss:
[EMAIL PROTECTED]
00 # first radix cluster both key columns on H bits (maybe different number
of passes and bit settings)
01 cluster_larger := radix_cluster(larger_key, L1,..,Ll);
02 cluster_smaller := radix_cluster(smaller_key, S1,..,Ss);
@@ -406,77 +404,77 @@
# partitioned hash join on clusters of H radix-bits.
03 res_join := phash_join(cluster_larger, cluster_smaller.reverse, H);
..
-\end{verbatim}
-Line 03 above uses phash\_join, but could alternatively use radix-join:
-\begin{verbatim}
[EMAIL PROTECTED] verbatim
+Line 03 above uses phash_join, but could alternatively use radix-join:
[EMAIL PROTECTED]
..
03 res_join := radix_join(cluster_larger, cluster_smaller.reverse, H);
..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
From this point on, the same code as in the previous MIL script could be
applied to fetch the column
values for columns a1..aY from "smaller" and b1..bZ from "larger". The
remaining problem here is that both
-the larger\_bX *and* the smaller\_aX BATs will tend to be bigger than the
smallest memory cache (though this
+the larger_bX *and* the smaller_aX BATs will tend to be bigger than the
smallest memory cache (though this
also depends on the join hit-ratio, but let use suppose it is >= 1).
To improve this, we can sort on the OIDs from the "larger" table, like
described in the previous section.
-This will enhance the access pattern of the subsequent positional joins to the
larger\_bX column BATs:
+This will enhance the access pattern of the subsequent positional joins to the
larger_bX column BATs:
-\begin{verbatim}
[EMAIL PROTECTED]
..
# partitioned hash join on clusters of H radix-bits, followed by
radix-sort on head column.
03 res_join := phash_join(cluster_larger, cluster_smaller.reverse,
H).reverse.radix_cluster(R1,..,Rp).reverse;
04 res_larger_sorted := res_join.mark([EMAIL PROTECTED]).reverse;
05 res_smaller := res_join.reverse.mark([EMAIL PROTECTED]).reverse;
..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
-Now, as the smaller\_aX column BATs probably are also larger than the memory
cache, we would like to do
-the same for the "smaller" table. But, we then cannot sort res\_join twice. We
could sort res\_smaller on tail
+Now, as the smaller_aX column BATs probably are also larger than the memory
cache, we would like to do
+the same for the "smaller" table. But, we then cannot sort res_join twice. We
could sort res_smaller on tail
after line 05:
-\begin{verbatim}
[EMAIL PROTECTED]
..
05 res_smaller := res_join.reverse.mark([EMAIL
PROTECTED]).reverse.radix_cluster(R1,..,Rp);
..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
However, this approach of the problem only transfers the problem to later
phases of query processing.
-The positional joins would run fine, but as a tail-sorted res\_smaller would
be a BAT[oid,oid] (i.e.
-it would no longer have a VOID head column), the result of the positional
joins with the smaller\_aX
+The positional joins would run fine, but as a tail-sorted res_smaller would be
a BAT[oid,oid] (i.e.
+it would no longer have a VOID head column), the result of the positional
joins with the smaller_aX
BAT[void,T]s would be of the form BAT[oid,T]. These results would not only
take more space than the
desired form BAT[void,T], but would also create a problem in further use of
the query result, as these
-res\_aX BATs will not be sorted on head. Join access to them would go to the
hash-join rather than the
+res_aX BATs will not be sorted on head. Join access to them would go to the
hash-join rather than the
positional join, and due to the random access this would pose a memory caching
problem as these
-res\_aX BATs tend to be larger than the memory cache.
+res_aX BATs tend to be larger than the memory cache.
Therefore, for these projection joins, we propose the use of a new
memory-conscious join algorithm that is
-called {\em clustered positional join} which implements a join(BAT[void,oid]
L, BAT[void,T] R) : BAT[void,T]
+called @emph{clustered positional join} which implements a join(BAT[void,oid]
L, BAT[void,T] R) : BAT[void,T]
This algorithm consists of three phases, of which we already know the first
two:
-\begin{description}
-\item[partial radix-cluster]
-First, the res\_smaller is {\em partially} radix-clustered on tail-OID. That
is, the relation L (= res\_smaller in
-the positional joins to the column BATs smaller\_aX) is clustered on some
number of highest significant
[EMAIL PROTECTED] @code
[EMAIL PROTECTED] [partial radix-cluster]
+First, the res_smaller is @emph{partially} radix-clustered on tail-OID. That
is, the relation L (= res_smaller in
+the positional joins to the column BATs smaller_aX) is clustered on some
number of highest significant
radix-bits, but not on all radix-bits. Because the radix-cluster on OIDs does
not use a hash-function,
clustering an OID column on all significant bits radix-sorts it, or - as in
this case - on a subset of the
highest significant bits, partially orders it. The partial ordering of OIDs in
chunks is done in such a way that the
size of the corresponding chunk in R (remember R is a BAT[void,T] and has all
OIDs in a densely ascending sequence) fits
the memory cache.
-\item[positional join]
[EMAIL PROTECTED] [positional join]
The purpose of the radix-clustering in the previous phase L is to accelerate
the positional join between L and R
-(i.e. res\_smaller and smaller\_aX). Because the OIDs in the tail of L are
now partially sorted, each chunk in L
+(i.e. res_smaller and smaller_aX). Because the OIDs in the tail of L are now
partially sorted, each chunk in L
will only randomly access data from one chunk in R. Therefore, during
positional join, these chunks in R stay
memory resident, accelerating performance with respect to to doing the same
with a non-clustered L (where each access to
R would be a cache miss).
-\item[radix decluster]
[EMAIL PROTECTED] [radix decluster]
The result of the positional join is a BAT[oid,T]. Still, we know that the
head column, when sorted, would
form an void column. What is now the fastest way to sort it and convert it
back to a void BAT? One special property
of the radix-cluster algorithm is that when we cluster on tail column, each
result chunk will have the head-values
-in order. In this case, the clustered version of L (res\_smaller\_clustered,
see below) has the head OIDs in order {\em within the
+in order. In this case, the clustered version of L (res_smaller_clustered, see
below) has the head OIDs in order @emph{within the
chunk}. This sub-ordering is also carried over by the positional join to
result of the positional join: in each
'virtual chunk' in the BAT[oid,T], the OIDs appear in order. Therefore, we can
perform an merge operation to merge
all BAT[oid,T] chunks into a BAT[void,T] result. Normally, the cost of a merge
is at least O(log(P)*N), where N
is the total number of tuples, and P is the number of chunks. By using the
special property that eventually,
-the merged OIDs form a densely ascending sequence ([EMAIL PROTECTED], [EMAIL
PROTECTED],..,[EMAIL PROTECTED]), we can bring this cost back to O(N)! This
{\em radix decluster}
+the merged OIDs form a densely ascending sequence (0@@0, 1@@0,..,N@@0), we can
bring this cost back to O(N)! This @emph{radix decluster}
algorithm keeps a windows open of OIDs [windowStart, windowStart+1, ...,
windowStart+windowSize-1] during the
merge. Each iteration of the algorithm finds all the next windowSize OIDs in
the chunks and inserts them
in the result BAT[void,T]. This is done by going to through all (not yet
empty) chunks and inserting from the top of
@@ -487,10 +485,10 @@
in the chunk, the chunk cache lines will be re-used and performance will be
good. The only restriction on the windowSize
is that the insertion window on the output BAT must fit the memory cache. This
will only start to exceed on very
large table sizes (a possible remedy is then to perform the merge in multiple
passes).
-\end{description}
[EMAIL PROTECTED] table
In all, in Monet this join strategy is expressed in the following MIL:
-\begin{verbatim}
[EMAIL PROTECTED]
..
# subcluster on Rs significant radix-bits, ignoring lowest Ri bits
05 res_smaller_clustered := res_join.reverse.mark([EMAIL
PROTECTED]).reverse.radix_cluster(-Ri, Rs);
@@ -501,7 +499,7 @@
AX ....
AY res_aY := join(res_smaller_clustered,
smaller_aY).radix_decluster(borders_smaller);
..
-\end{verbatim}
[EMAIL PROTECTED] verbatim
Possibly, one could consider to use this approach as well for the projections
on the larger table
(instead of the sort). However, sorting once (with radix-sort by
radix-clustering on all significant
@@ -514,7 +512,7 @@
The full MIL script when the individual BATs of both the "smaller" and
"larger" tables as well as
the (N-M) join result, exceed the memory cache becomes:
-\begin{verbatim}
[EMAIL PROTECTED]
# first radix cluster both key columns on H bits (maybe different number
of passes and bit settings)
01 cluster_larger := radix_cluster(larger_key, L1,..,Ll);
02 cluster_smaller := radix_cluster(smaller_key, S1,..,Ss);
@@ -536,10 +534,9 @@
B1 res_b1 := join(res_larger_sorted, larger_b1);
BX ....
BZ res_bZ := join(res_larger_sorted, larger_bZ);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
@+ The Relational Approach
[EMAIL PROTECTED]
A cache-conscious join in a relational DBMS would first radix-cluster both the
smaller and larger table,
where in the process it would project on just the selected columns. As the
relational model does not
separate its algebraic actions by column, as Monet in MIL does, it cannot use
the technique of type-expansion
@@ -561,7 +558,7 @@
all other columns.
The entire MIL sequence are then the following 5 statements:
-\begin{verbatim}
[EMAIL PROTECTED]
01 smaller_all := [integer]([integer](smaller_key,1).reverse, 128).reverse;
02 larger_all := [integer]([integer](larger_key,1).reverse, 128).reverse;
@@ -571,10 +568,10 @@
05 cluster_smaller := radix_cluster(smaller_view, S1,..,Ss);
06 cluster_larger := radix_cluster(larger_view, L1,..,Ll);
07 res_join := phash_join(cluster_larger, cluster_smaller.reverse, H);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
-Notice that the fact that smaller\_view and larger\_view are MIL views on the
base BATs
-smaller\_all and larger\_all, means that the projection is never materialized.
Projecting
+Notice that the fact that smaller_view and larger_view are MIL views on the
base BATs
+smaller_all and larger_all, means that the projection is never materialized.
Projecting
is done on the fly during the first pass of radix cluster, just like what
would happen in
a relational system. What is more, the copying of each integer8 (view) value
from its
storage as integer128 is done with 8 memcpy() calls that fetch values at
regular intervals
@@ -590,9 +587,9 @@
Therefore, it might even be that the increased cost of radix-cluster will make
simple
hash-join faster than partitioned hash-join:
-\begin{verbatim}
[EMAIL PROTECTED]
05 res_join := join(larger_view, smaller_view.reverse);
-\end{verbatim}
[EMAIL PROTECTED] verbatim
In either way, we expect a relational performance to be a factor 10 slower
than Monet
on big sizes of both the "smaller" and "larger" tables with hit rates such
that the result table is big.
@@ -1714,7 +1711,6 @@
@+ Radix Cluster
[EMAIL PROTECTED]
In radix cluster we want to deliver one new BAT that consists
of a consecutive memory area (like all BATs do) with the tuples
clustered on a certain radix. To do this correctly in one scan
@@ -1736,25 +1732,30 @@
TODO (easy): make N tunable
functions:
-\begin{itemize}
-\item radix\_buns() does the basic clustering stuff (95% of effort)
-\item cnt\_buns() is the histogram scan that is triggered when we need
- to move data
-\item move\_buns() does the moving work. This is tricky; as copy
- dependencies bind you to a certain schedule.
-\item radix\_chunk() is the main routine that does one radix scan and
- produces a new (chunk of a) bat. Does so by first going
- for uniform distributions and executing radix\_buns().
- If a buffer overflows, this stops and cnt\_buns() is
- done. The buffers are then moved into correct position
- by move\_buns(). Clustering is then finished by again
- radix\_buns().
-\item radix\_cluster() is the multilevel radix cluster routine. On the first
- level; it processes 1 chunk and produced N1 new ones.
- On the second level, N1 chunks are processed and divided
- each in N2 new ones (making for N1*N2 clusters), etc.
- For clustering each chunk, it calls radix\_chunk().
-\end{itemize}
[EMAIL PROTECTED]
[EMAIL PROTECTED] radix_buns()
+does the basic clustering stuff (95% of effort)
[EMAIL PROTECTED] cnt_buns()
+is the histogram scan that is triggered when we need
+to move data
[EMAIL PROTECTED] move_buns()
+does the moving work. This is tricky; as copy
+dependencies bind you to a certain schedule.
[EMAIL PROTECTED] radix_chunk()
+is the main routine that does one radix scan and
+produces a new (chunk of a) bat. Does so by first going
+for uniform distributions and executing radix_buns().
+If a buffer overflows, this stops and cnt_buns() is
+done. The buffers are then moved into correct position
+by move_buns(). Clustering is then finished by again
+radix_buns().
[EMAIL PROTECTED] radix_cluster()
+is the multilevel radix cluster routine. On the first
+level; it processes 1 chunk and produced N1 new ones.
+On the second level, N1 chunks are processed and divided
+each in N2 new ones (making for N1*N2 clusters), etc.
+For clustering each chunk, it calls radix_chunk().
[EMAIL PROTECTED] itemize
@c
#define any_RADIX(p,rd) ((hash_t)((*BATatoms[any].atomHash)(p) & rd))
#define oid_RADIX(p,rd) ((hash_t) *(oid*) (p) & rd)
@@ -2997,7 +2998,7 @@
its non-sorted OID head column into a sorted and densely ascending void
column. The tail
itself does not need to contain the partially sorted values; in fact we do not
even look
at those values but directly take the cluster boundaries as an input
parameter. This input
-parameter is a radix\_count BAT that must have been created with the
radix\_count command.
+parameter is a radix_count BAT that must have been created with the
radix_count command.
@= radix_decluster
{
@@ -3175,13 +3176,12 @@
}
@- radix_decluster
[EMAIL PROTECTED]
intends to improve performance by using a cheaper positional join:
join(bat[void,oid], bat[void,T])
instead of
join(bat[oid,oid], bat[void,T])
-price paid are two parameters to radix\_decluster.
+price paid are two parameters to radix_decluster.
@= radix_decluster2
{
Index: batxml.mx
===================================================================
RCS file: /cvsroot/monetdb/MonetDB5/src/modules/mal/batxml.mx,v
retrieving revision 1.23
retrieving revision 1.24
diff -u -d -r1.23 -r1.24
--- batxml.mx 11 Jan 2008 10:41:37 -0000 1.23
+++ batxml.mx 8 Feb 2008 22:36:34 -0000 1.24
@@ -797,6 +797,7 @@
BBPunfix(e->batCacheid);
BBPunfix(g->batCacheid);
BBPunfix(b->batCacheid);
+ bn->tsorted = 0;
BBPkeepref(*ret=bn->batCacheid);
return MAL_SUCCEED;
bunins_failed:
@@ -867,6 +868,7 @@
BBPunfix(r->batCacheid);
BBPunfix(j->batCacheid);
GDKfree(buf);
+ bn->tsorted = 0;
BBPkeepref(*ret=bn->batCacheid);
return MAL_SUCCEED;
bunins_failed:
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Monetdb-checkins mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-checkins