Hi Tim, I am currently writing developer's guide for CloudBase. It will explain the CloudBase design in detail as well as the algorithms used to do Joins, Indexing etc.
However, for indexing here is a brief introduction which you can also find on the CloudBase website- http://cloudbase.sourceforge.net/index.html#userDoc ------------------------------------------------------------------- CloudBase supports Hash Indexing on any column of a table. Index can be created on multiple columns as well (separate index for each column). During indexing process, CloudBase creates an Inverted Index for column values. An Inverted Index can be viewed as a persistent HashMap where keys are the column values and values are the lists containing information about Data Blocks containing the column values. CloudBase also stores the MIN and MAX value of a column (if its type is numeric) present in a Data Block. These values are useful for range queries. During query execution process, if an indexed column is present in a WHERE clause (for example WHERE c1 = 100 or c2 > 100) then CloudBase will locate all those Data Blocks (from its index data) where these column values are present and load only those Data Blocks instead of loading all Data Blocks. This indexing is very useful when query involves filtering on date or date range and we have seen performance improvements of up to 97% in such cases. ------------------------------------------------------------------- For Joins, I have not explained the algorithm in the documentation, however I am planning to do it so very soon. However, the process is explained in the code comments. For your convenience I am pasting the comment from the code here- ---------------------------------------------------- /** * Executes Map/Reduce job to handle join (inner or outer join) between two * tables/subqueries. * <p> * It uses semi-join approach to optimize the join process. Join algorithm for * inner join is as follows- first smaller table is reduced using the columns * participating in join condition as Keys (for MR job). Also, bloom-filter * for columns participating in join is constructed which is used to filter * out the rows of bigger table during the map process(later). Bigger table is also * reduced using the join columns as Keys. When rows of bigger table reach * reducers, rows of smaller table are read back from HDFS and join is * performed. *<p> * This algorithm makes use of one property of Map-Reduce paradigm - Same keys * go to one reducer. So this means, rows (of both tables) having same values * of columns participating in join will go to same reducer (as join columns * are used as Keys) and hence a join can be easily performed at reducer. * Although, it seems reading rows of smaller table from HDFS and performing * join will be in-efficient, it is not. There are two reasons for the same- * first, Hadoop tries to store a copy of the data at the node which produced * that data. So this means, a reducer will be reading a local copy of the * rows of smaller table. Second reason is, a Map-Reduce job, sort the keys * so this means, rows of both smaller and bigger table are sorted on the * join columns. This makes it easier to read rows and perform join. * <p> * In case of left outer join, left table is always reduced first. In case * of right outer join, right table is reduced first. * * @see http://en.wikipedia.org/wiki/Relational_algebra#Semijoin * @see http://en.wikipedia.org/wiki/Bloom_filter */ ---------------------------------------------------- Thanks, Taran 2009/3/3 tim robertson <[email protected]> > Hi Taran, > > Have you a blog or something that explains how you process joins on > cloudbase? > > E.g. how are indexes used, and how do you go through the joining using > the data files and index files. > Do you look at all possible indexes, determine the cardinality of each > and from this pick a join order, or do you start at both ends of the > query and meet in the middle (if you know what I mean...). This > really influences schema design for large datasets in mysql (e.g. you > need to store your own cardinality as mysql can't determine the best > join order inherently) so I am wondering about porting my reporting > application. > > I think this kind of info would be great for cloudbase docs. > > Cheers, > > Tim > > > 2009/3/3 Tarandeep Singh <[email protected]>: > > Tim is right. CloudBase is not equivalent to HBase. > > > > HBase is column oriented database based on Google's BigTable. CloudBase > is a > > database/data warehosue layer on top of Hadoop and by means of its SQL > > interface makes it easier to mine logs. So instead of writing Map-Reduce > > jobs for analyzing data, one can use SQL to do the same and SQL to Map > > Reduce job translation is handled by CloudBase. > > > > -Taran > > > > 2009/3/3 tim robertson <[email protected]> > > > >> Hi Praveen, > >> > >> I think it is more equivalent to Hive than HBase - both offer joins > >> and structured querying where HBase is more a column oriented data > >> store with many to ones embedded in a single row and (currently) only > >> indexes on the primary key, but secondary keys are coming. I > >> anticipate using HBase as a back end to harvest into, but might make > >> use of Hive or Cloudbase for ad hoc reporting when needed. > >> > >> Has anyone done any testing of Hive vs. Cloudbase for performance and > >> comparison of features? > >> > >> Cheers, > >> > >> Tim > >> > >> > >> 2009/3/3 Guttikonda, Praveen <[email protected]>: > >> > Hi , > >> > Will this be competing in a sense with HBASE then ? > >> > > >> > Cheers, > >> > Praveen > >> > > >> > -----Original Message----- > >> > From: Tarandeep Singh [mailto:[email protected]] > >> > Sent: Tuesday, March 03, 2009 10:12 PM > >> > To: [email protected] > >> > Subject: Re: Announcing CloudBase-1.2.1 release > >> > > >> > Hi Lukas, > >> > > >> > Yes, you are right. As of now, CloudBase does not support unique keys > and > >> foreign keys on tables. CloudBase is designed as a database abstraction > >> layer on top of Hadoop, thus making it easier to query/mine logs/huge > data > >> easily. > >> > > >> > -Taran > >> > > >> > > >> > On Tue, Mar 3, 2009 at 1:15 AM, Lukáš Vlček <[email protected]> > >> wrote: > >> > > >> >> Hi Taran, > >> >> This looks impressive. I quickly looked at the documentation, am I > >> >> right that it does not support unique keys and foreign keys for > tables? > >> >> > >> >> Regards, > >> >> Lukas > >> >> > >> >> On Mon, Mar 2, 2009 at 8:33 PM, Tarandeep Singh <[email protected] > > > >> >> wrote: > >> >> > >> >> > Hi, > >> >> > > >> >> > We have just released 1.2.1 version of CloudBase on sourceforge- > >> >> > http://cloudbase.sourceforge.net > >> >> > > >> >> > [ CloudBase is a data warehouse system built on top of Hadoop's > >> >> Map-Reduce > >> >> > architecture. It uses ANSI SQL as its query language and comes with > >> >> > a > >> >> JDBC > >> >> > driver. It is developed by Business.com and is released to open > >> >> > source community under GNU GPL license] > >> >> > > >> >> > This release fixes one issue with the 1.2 release- Table Indexing > >> >> > feature was not enabled in the 1.2 release. This release fixes this > >> issue. > >> >> > > >> >> > Also we have updated the svn repository on the sourceforge site and > >> >> > we invite contributors to work with us to improve CloudBase. The > svn > >> >> > repository url is- > >> >> > https://cloudbase.svn.sourceforge.net/svnroot/cloudbase/trunk > >> >> > > >> >> > We will be uploading Developer's guide/documentation on the > >> >> > CloudBase website very soon. Meanwhile, if someone wants to try > >> >> > compiling the code and play around with it, please contact me, I > can > >> >> > help you get started. > >> >> > > >> >> > Thanks, > >> >> > Taran > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> http://blog.lukas-vlcek.com/ > >> >> > >> > > >> > > >
