Re: MonetDB: headless - This is the start of the branch to get rid o...

Lefteris Thu, 10 Feb 2011 02:14:04 -0800

Hi,

since these structures are changed, will you consider to add also
inter-col properties (such as information that this col is subordered
on that col, to allow multi-dimensional clustering)? Plus, what about
min and max fields next to the count one?


On Tue, Feb 8, 2011 at 3:36 PM, Sjoerd Mullender <[email protected]> wrote:
> Changeset: d4a362c539db for MonetDB
> URL: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=d4a362c539db
> Modified Files:
>        gdk/Makefile.ag
>        gdk/gdk.h
>        gdk/gdk.mx
> Branch: headless
> Log Message:
>
> This is the start of the branch to get rid of BATs.
> BATs are *binary* association tables.  We want to move towards single
> column structures called COLs.  Basically, these are BATs whose HEAD
> column is always of type VOID-with-seqbase.
>
> Work on this is done on a branch called "headless".
>
> In addition, we're going to move away from the use of Mx.
>
> This changeset is a (small) step on both fronts.
>
> BATstore, BAT, COLrec, BATrec, and BUNrec are all gone and have been
> replaced by a single structure called COL.  I have also gone through
> the macro definitions and function declarations and removed references
> to head/tail (although I may have missed a few).
>
> There is still a lot of work to be done.  I can guarantee, the current
> version cannot get through even the most lenient compiler.
>
>
> diffs (truncated from 5310 to 300 lines):
>
> diff --git a/gdk/Makefile.ag b/gdk/Makefile.ag
> --- a/gdk/Makefile.ag
> +++ b/gdk/Makefile.ag
> @@ -38,7 +38,7 @@
>                gdk_scanselect_defs_str.mx \
>                gdk_scanselect_defs_fix.mx \
>                gdk_scanselect_defs_var.mx \
> -               gdk_scanselect.mx gdk.mx gdk_batop.mx \
> +               gdk_scanselect.mx gdk_batop.mx \
>                gdk_search.mx gdk_tm.mx gdk_align.mx gdk_bbp.mx \
>                gdk_heap.mx gdk_setop.mx gdk_utils.mx gdk_atoms.mx \
>                gdk_qsort.mx gdk_ssort.mx gdk_storage.mx gdk_bat.mx \
> @@ -52,3 +52,5 @@
>                $(SOCKET_LIBS) $(zlib_LIBS) $(BZ_LIBS) \
>                $(MALLOC_LIBS) $(PTHREAD_LIBS) $(DL_LIBS)
>  }
> +
> +EXTRA_DIST = gdk.h
> diff --git a/gdk/gdk.mx b/gdk/gdk.h
> rename from gdk/gdk.mx
> rename to gdk/gdk.h
> --- a/gdk/gdk.mx
> +++ b/gdk/gdk.h
> @@ -1,392 +1,393 @@
> -@/
> -The contents of this file are subject to the MonetDB Public License
> -Version 1.1 (the "License"); you may not use this file except in
> -compliance with the License. You may obtain a copy of the License at
> -http://monetdb.cwi.nl/Legal/MonetDBLicense-1.1.html
> +/*
> + * The contents of this file are subject to the MonetDB Public License
> + * Version 1.1 (the "License"); you may not use this file except in
> + * compliance with the License. You may obtain a copy of the License at
> + * http://monetdb.cwi.nl/Legal/MonetDBLicense-1.1.html
> + *
> + * Software distributed under the License is distributed on an "AS IS"
> + * basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
> + * License for the specific language governing rights and limitations
> + * under the License.
> + *
> + * The Original Code is the MonetDB Database System.
> + *
> + * The Initial Developer of the Original Code is CWI.
> + * Portions created by CWI are Copyright (C) 1997-July 2008 CWI.
> + * Copyright August 2008-2011 MonetDB B.V.
> + * All Rights Reserved.
> + */
>
> -Software distributed under the License is distributed on an "AS IS"
> -basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
> -License for the specific language governing rights and limitations
> -under the License.
> +/*
> + * @f gdk
> + * @t The Goblin Database Kernel
> + * @v Version 3.05
> + * @a Martin L. Kersten, Peter Boncz, Niels Nes
> + *
> + * @+ The Inner Core
> + * The innermost library of the MonetDB database system is formed by
> + * the library called GDK, an abbreviation of Goblin Database Kernel.
> + * Its development was originally rooted in the design of a pure
> + * active-object-oriented programming language, before development
> + * was shifted towards a re-usable database kernel engine.
> + *
> + * GDK is a C library that provides ACID properties on a DSM model
> + * @tex
> + * [@cite{Copeland85}]
> + * @end tex
> + * , using main-memory
> + * database algorithms
> + * @tex
> + * [@cite{Garcia-Molina92}]
> + * @end tex
> + *  built on virtual-memory
> + * OS primitives and multi-threaded parallelism.
> + * Its implementation has undergone various changes over its decade
> + * of development, many of which were driven by external needs to
> + * obtain a robust and fast database system.
> + *
> + * The coding scheme explored in GDK has also laid a foundation to
> + * communicate over time experiences and to provide (hopefully)
> + * helpful advice near to the place where the code-reader needs it.
> + * Of course, over such a long time the documentation diverges from
> + * reality. Especially in areas where the environment of this package
> + * is being described.
> + * Consider such deviations as historic landmarks, e.g. crystallization
> + * of brave ideas and mistakes rectified at a later stage.
> + *
> + * @+ Short Outline
> + * The facilities provided in this implementation are:
> + * @itemize
> + * @item
> + * GDK or Goblin Database Kernel routines for session management
> + * @item
> + *  BAT routines that define the primitive operations on the
> + * database tables (BATs).
> + * @item
> + *  BBP routines to manage the BAT Buffer Pool (BBP).
> + * @item
> + *  ATOM routines to manipulate primitive types, define new types
> + * using an ADT interface.
> + * @item
> + *  HEAP routines for manipulating heaps: linear spaces of memory
> + * that are GDK's vehicle of mass storage (on which BATs are built).
> + * @item
> + *  DELTA routines to access inserted/deleted elements within a
> + * transaction.
> + * @item
> + *  HASH routines for manipulating GDK's built-in linear-chained
> + * hash tables, for accelerating lookup searches on BATs.
> + * @item
> + *  TM routines that provide basic transaction management primitives.
> + * @item
> + *  TRG routines that provided active database support. [DEPRECATED]
> + * @item
> + *  ALIGN routines that implement BAT alignment management.
> + * @end itemize
> + *
> + * The Binary Association Table (BAT) is the lowest level of storage
> + * considered in the Goblin runtime system
> + * @tex
> + * [@cite{Goblin}]
> + * @end tex
> + * .  A BAT is a
> + * self-descriptive main-memory structure that represents the @strong{binary
> + * relationship} between two atomic types.
> + * The association can be defined over:
> + * @table @code
> + * @item void:
> + *  virtual-OIDs: a densely ascending column of OIDs (takes zero-storage).
> + * @item bit:
> + *  Booleans, implemented as one byte values.
> + * @item chr:
> + * A single character (8 bits @strong{integer}s).
> + * DEPRECATED for storing text (Unicode not supported).
> + * @item bte:
> + *  Tiny (1-byte) integers (8-bit @strong{integer}s).
> + * @item sht:
> + *  Short integers (16-bit @strong{integer}s).
> + * @item int:
> + *  This is the C @strong{int} type (32-bit).
> + * @item oid:
> + *  Unique @strong{long int} values uses as object identifier. Highest bit 
> cleared always.
> + *         Thus, oids-s are 31-bit numbers on 32-bit systems, and 63-bit 
> numbers on 64-bit systems.
> + * @item wrd:
> + *  Machine-word sized integers
> + *  (32-bit on 32-bit systems, 64-bit on 64-bit systems).
> + * @item ptr:
> + * Memory pointer values. DEPRECATED.  Can only be stored in transient BATs.
> + * @item flt:
> + *  The IEEE @strong{float} type.
> + * @item dbl:
> + *  The IEEE @strong{double} type.
> + * @item lng:
> + *  Longs: the C @strong{long long} type (64-bit integers).
> + * @item str:
> + *  UTF-8 strings (Unicode). A zero-terminated byte sequence.
> + * @item bat:
> + *  Bat descriptor. This allows for recursive adminstered tables, but
> + * severely complicates transaction management. Therefore, they
> + * CAN ONLY BE STORED IN TRANSIENT BATs.
> + * @end table
> + *
> + * This model can be used as a back-end model underlying other -higher
> + * level- models, in order to achieve @strong{better performance} and
> + * @strong{data independence} in one go. The relational model and
> + * the object-oriented model can be mapped on BATs by vertically
> + * splitting every table (or class) for each attribute. Each such a
> + * column is then stored in a BAT with type @strong{bat[oid,attribute]}, 
> where
> + * the unique object identifiers link tuples in the different BATs.
> + * Relationship attributes in the object-oriented model hence are
> + * mapped to @strong{bat[oid,oid]} tables, being equivalent to the concept of
> + * @emph{join indexes}
> + * @tex
> + * [@cite{Valduriez87}]
> + * @end tex
> + * .
> + *
> + * The set of built-in types can be extended with user-defined types
> + * through an ADT interface.  They are linked with the kernel to obtain
> + * an enhanced library, or they are dynamically loaded upon request.
> + *
> + * Types can be derived from other types. They represent something different
> + * than that from which they are derived, but their internal storage 
> management
> + * is equal. This feature facilitates the work of extension programmers, by
> + * enabling reuse of implementation code, but is also used to keep the GDK 
> code
> + * portable from 32-bits to 64-bits machines: the @strong{oid} and 
> @strong{ptr} types
> + * are derived from @strong{int} on 32-bits machines, but is derived from 
> @strong{lng}
> + * on 64 bits machines. This requires changes in only two lines of code each.
> + *
> + * To accelerate lookup and search in BATs, GDK supports one built-in
> + * search accelerator: hash tables. We choose an implementation efficient
> + * for main-memory: bucket chained hash
> + * @tex
> + * [@cite{LehCar86,Analyti92}]
> + * @end tex
> + * . Alternatively, when the table is sorted, it will resort to merge-scan
> + * operations or binary lookups.
> + *
> + * BATs are built on the concept of heaps, which are large pieces of main
> + * memory. They can also consist of virtual memory, in case the working
> + * set exceeds main-memory. In this case, GDK supports operations that
> + * cluster the heaps of a BAT, in order to improve performance of its
> + * main-memory.
> + *
> + *
> + * @- Rationale
> + * The rationale for choosing a BAT as the building block for both
> + * relational and object-oriented system is based on the following
> + * observations:
> + *
> + * @itemize
> + * @item -
> + * Given the fact that CPU speed and main-memory increase in
> + * current workstation hardware for the last years has been exceeding
> + * IO access speed increase, traditional disk-page oriented algorithms
> + * do no longer take best advantage of hardware, in most database operations.
> + *
> + * Instead of having a disk-block oriented kernel with a large memory
> + * cache, we choose to build a main-memory kernel, that only under large data
> + * volumes slowly degrades to IO-bound performance, comparable to
> + * traditional systems
> + * @tex
> + * [@cite{boncz95,boncz96}]
> + * @end tex
> + * .
> + *
> + * @item -
> + * Traditional (disk-based) relational systems move too much data
> + * around to save on (main-memory) join operations.
> + *
> + * The fully decomposed store (DSM
> + * @tex
> + * [@cite{Copeland85})]
> + * @end tex
> + * assures that only those attributes of a relation that are needed,
> + * will have to be accessed.
> + *
> + * @item -
> + * The data management issues for a binary association is much
> + * easier to deal with than traditional @emph{struct}-based approaches
> + * encountered in relational systems.
> + *
> + * @item -
> + * Object-oriented systems often maintain a double cache, one with the
> + * disk-based representation and a C pointer-based main-memory structure.
> + * This causes expensive conversions and replicated storage management.
> + * GDK does not do such `pointer swizzling'. It used virtual-memory
> + * (@strong{mmap()}) and buffer management advice (@strong{madvise()}) OS 
> primitives to
> + * cache only once. Tables take the same form in memory as on disk,
> + * making the use of this technique transparent
> + * @tex
> + * [@cite{oo7}]
> + * @end tex
> + * .
> + * @end itemize
> + *
> + * A RDBMS or OODBMS based on BATs strongly depends on our ability
> + * to efficiently support tuples and to handle small joins, respectively.
> + *
> + * The remainder of this document describes the Goblin Database kernel
> + * implementation at greater detail. It is organized as follows:
> + * @table @code
> + * @item @strong{GDK Interface}:
> + *
> + * It describes the global interface with which GDK sessions can be
> + * started and ended, and environment variables used.
> + *
> + * @item @strong{Binary Association Tables}:
> + *
> + * As already mentioned, these are the primary data structure of GDK.
> + * This chapter describes the kernel operations for creation, destruction
> + * and basic manipulation of BATs and BUNs (i.e. tuples: Binary UNits).
> + *
> + * @item @strong{BAT Buffer Pool:}
> + *
> + * All BATs are registered in the BAT Buffer Pool. This directory is used
> + * to guide swapping in and out of BATs. Here we find routines that guide
> + * this swapping process.
> + *
> + * @item @strong{GDK Extensibility:}
> + *
> + * Atoms can be defined using a unified ADT interface.
> + * There is also an interface to extend the GDK library with
> + * dynamically linked object code.
> + *
> + * @item @strong{GDK Utilities:}
> + *
> + * Memory allocation and error handling primitives are provided. Layers
> + * built on top of GDK should use them, for proper system monitoring.
> + * Thread management is also included here.
> + *
> + * @item @strong{Transaction Management:}
> + *
> + * For the time being, we just provide BAT-grained concurrency and global
> + * transactions. Work is needed here.
> + *
> + * @item @strong{BAT Alignment:}
> + * Due to the mapping of multi-ary datamodels onto the BAT model,
> _______________________________________________
> Checkin-list mailing list
> [email protected]
> http://mail.monetdb.org/mailman/listinfo/checkin-list
>
_______________________________________________
Checkin-list mailing list
[email protected]
http://mail.monetdb.org/mailman/listinfo/checkin-list

Re: MonetDB: headless - This is the start of the branch to get rid o...

Reply via email to