Thanks Taran for the detailed explanation - makes perfect sense.
2009/3/3 Tarandeep Singh <[email protected]>: > Hi Tim, > > I am currently writing developer's guide for CloudBase. It will explain the > CloudBase design in detail as well as the algorithms used to do Joins, > Indexing etc. > > However, for indexing here is a brief introduction which you can also find > on the CloudBase website- > http://cloudbase.sourceforge.net/index.html#userDoc > > ------------------------------------------------------------------- > CloudBase supports Hash Indexing on any column of a table. Index can be > created on multiple columns as well (separate index for each column). During > indexing process, CloudBase creates an Inverted Index for column values. An > Inverted Index can be viewed as a persistent HashMap where keys are the > column values and values are the lists containing information about Data > Blocks containing the column values. > > CloudBase also stores the MIN and MAX value of a column (if its type is > numeric) present in a Data Block. These values are useful for range queries. > > During query execution process, if an indexed column is present in a WHERE > clause (for example WHERE c1 = 100 or c2 > 100) then CloudBase will locate > all those Data Blocks (from its index data) where these column values are > present and load only those Data Blocks instead of loading all Data Blocks. > > This indexing is very useful when query involves filtering on date or date > range and we have seen performance improvements of up to 97% in such cases. > ------------------------------------------------------------------- > > > For Joins, I have not explained the algorithm in the documentation, however > I am planning to do it so very soon. However, the process is explained in > the code comments. For your convenience I am pasting the comment from the > code here- > > ---------------------------------------------------- > /** > * Executes Map/Reduce job to handle join (inner or outer join) between > two > * tables/subqueries. > * <p> > * It uses semi-join approach to optimize the join process. Join algorithm > for > * inner join is as follows- first smaller table is reduced using the > columns > * participating in join condition as Keys (for MR job). Also, > bloom-filter > * for columns participating in join is constructed which is used to > filter > * out the rows of bigger table during the map process(later). Bigger > table is also > * reduced using the join columns as Keys. When rows of bigger table reach > * reducers, rows of smaller table are read back from HDFS and join is > * performed. > *<p> > * This algorithm makes use of one property of Map-Reduce paradigm - Same > keys > * go to one reducer. So this means, rows (of both tables) having same > values > * of columns participating in join will go to same reducer (as join > columns > * are used as Keys) and hence a join can be easily performed at reducer. > * Although, it seems reading rows of smaller table from HDFS and > performing > * join will be in-efficient, it is not. There are two reasons for the > same- > * first, Hadoop tries to store a copy of the data at the node which > produced > * that data. So this means, a reducer will be reading a local copy of the > > * rows of smaller table. Second reason is, a Map-Reduce job, sort the > keys > * so this means, rows of both smaller and bigger table are sorted on the > * join columns. This makes it easier to read rows and perform join. > * <p> > * In case of left outer join, left table is always reduced first. In case > * of right outer join, right table is reduced first. > * > * @see http://en.wikipedia.org/wiki/Relational_algebra#Semijoin > * @see http://en.wikipedia.org/wiki/Bloom_filter > */ > ---------------------------------------------------- > > Thanks, > Taran > > 2009/3/3 tim robertson <[email protected]> > >> Hi Taran, >> >> Have you a blog or something that explains how you process joins on >> cloudbase? >> >> E.g. how are indexes used, and how do you go through the joining using >> the data files and index files. >> Do you look at all possible indexes, determine the cardinality of each >> and from this pick a join order, or do you start at both ends of the >> query and meet in the middle (if you know what I mean...). This >> really influences schema design for large datasets in mysql (e.g. you >> need to store your own cardinality as mysql can't determine the best >> join order inherently) so I am wondering about porting my reporting >> application. >> >> I think this kind of info would be great for cloudbase docs. >> >> Cheers, >> >> Tim >> >> >> 2009/3/3 Tarandeep Singh <[email protected]>: >> > Tim is right. CloudBase is not equivalent to HBase. >> > >> > HBase is column oriented database based on Google's BigTable. CloudBase >> is a >> > database/data warehosue layer on top of Hadoop and by means of its SQL >> > interface makes it easier to mine logs. So instead of writing Map-Reduce >> > jobs for analyzing data, one can use SQL to do the same and SQL to Map >> > Reduce job translation is handled by CloudBase. >> > >> > -Taran >> > >> > 2009/3/3 tim robertson <[email protected]> >> > >> >> Hi Praveen, >> >> >> >> I think it is more equivalent to Hive than HBase - both offer joins >> >> and structured querying where HBase is more a column oriented data >> >> store with many to ones embedded in a single row and (currently) only >> >> indexes on the primary key, but secondary keys are coming. I >> >> anticipate using HBase as a back end to harvest into, but might make >> >> use of Hive or Cloudbase for ad hoc reporting when needed. >> >> >> >> Has anyone done any testing of Hive vs. Cloudbase for performance and >> >> comparison of features? >> >> >> >> Cheers, >> >> >> >> Tim >> >> >> >> >> >> 2009/3/3 Guttikonda, Praveen <[email protected]>: >> >> > Hi , >> >> > Will this be competing in a sense with HBASE then ? >> >> > >> >> > Cheers, >> >> > Praveen >> >> > >> >> > -----Original Message----- >> >> > From: Tarandeep Singh [mailto:[email protected]] >> >> > Sent: Tuesday, March 03, 2009 10:12 PM >> >> > To: [email protected] >> >> > Subject: Re: Announcing CloudBase-1.2.1 release >> >> > >> >> > Hi Lukas, >> >> > >> >> > Yes, you are right. As of now, CloudBase does not support unique keys >> and >> >> foreign keys on tables. CloudBase is designed as a database abstraction >> >> layer on top of Hadoop, thus making it easier to query/mine logs/huge >> data >> >> easily. >> >> > >> >> > -Taran >> >> > >> >> > >> >> > On Tue, Mar 3, 2009 at 1:15 AM, Lukáš Vlček <[email protected]> >> >> wrote: >> >> > >> >> >> Hi Taran, >> >> >> This looks impressive. I quickly looked at the documentation, am I >> >> >> right that it does not support unique keys and foreign keys for >> tables? >> >> >> >> >> >> Regards, >> >> >> Lukas >> >> >> >> >> >> On Mon, Mar 2, 2009 at 8:33 PM, Tarandeep Singh <[email protected] >> > >> >> >> wrote: >> >> >> >> >> >> > Hi, >> >> >> > >> >> >> > We have just released 1.2.1 version of CloudBase on sourceforge- >> >> >> > http://cloudbase.sourceforge.net >> >> >> > >> >> >> > [ CloudBase is a data warehouse system built on top of Hadoop's >> >> >> Map-Reduce >> >> >> > architecture. It uses ANSI SQL as its query language and comes with >> >> >> > a >> >> >> JDBC >> >> >> > driver. It is developed by Business.com and is released to open >> >> >> > source community under GNU GPL license] >> >> >> > >> >> >> > This release fixes one issue with the 1.2 release- Table Indexing >> >> >> > feature was not enabled in the 1.2 release. This release fixes this >> >> issue. >> >> >> > >> >> >> > Also we have updated the svn repository on the sourceforge site and >> >> >> > we invite contributors to work with us to improve CloudBase. The >> svn >> >> >> > repository url is- >> >> >> > https://cloudbase.svn.sourceforge.net/svnroot/cloudbase/trunk >> >> >> > >> >> >> > We will be uploading Developer's guide/documentation on the >> >> >> > CloudBase website very soon. Meanwhile, if someone wants to try >> >> >> > compiling the code and play around with it, please contact me, I >> can >> >> >> > help you get started. >> >> >> > >> >> >> > Thanks, >> >> >> > Taran >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> http://blog.lukas-vlcek.com/ >> >> >> >> >> > >> >> >> > >> >
