Thanks Taran for the detailed explanation - makes perfect sense.



2009/3/3 Tarandeep Singh <[email protected]>:
> Hi Tim,
>
> I am currently writing developer's guide for CloudBase. It will explain the
> CloudBase design in detail as well as the algorithms used to do Joins,
> Indexing etc.
>
> However, for indexing here is a brief introduction which you can also find
> on the CloudBase website-
> http://cloudbase.sourceforge.net/index.html#userDoc
>
> -------------------------------------------------------------------
> CloudBase supports Hash Indexing on any column of a table. Index can be
> created on multiple columns as well (separate index for each column). During
> indexing process, CloudBase creates an Inverted Index for column values. An
> Inverted Index can be viewed as a persistent HashMap where keys are the
> column values and values are the lists containing information about Data
> Blocks containing the column values.
>
> CloudBase also stores the MIN and MAX value of a column (if its type is
> numeric) present in a Data Block. These values are useful for range queries.
>
> During query execution process, if an indexed column is present in a WHERE
> clause (for example WHERE c1 = 100 or c2 > 100) then CloudBase will locate
> all those Data Blocks (from its index data) where these column values are
> present and load only those Data Blocks instead of loading all Data Blocks.
>
> This indexing is very useful when query involves filtering on date or date
> range and we have seen performance improvements of up to 97% in such cases.
> -------------------------------------------------------------------
>
>
> For Joins, I have not explained the algorithm in the documentation, however
> I am planning to do it so very soon. However, the process is explained in
> the code comments. For your convenience I am pasting the comment from the
> code here-
>
> ----------------------------------------------------
>  /**
>   * Executes Map/Reduce job to handle join (inner or outer join) between
> two
>   * tables/subqueries.
>   * <p>
>   * It uses semi-join approach to optimize the join process. Join algorithm
> for
>   * inner join is as follows- first smaller table is reduced using the
> columns
>   * participating in join condition as Keys (for MR job). Also,
> bloom-filter
>   * for columns participating in join is constructed which is used to
> filter
>   * out the rows of bigger table during the map process(later). Bigger
> table is also
>   * reduced using the join columns as Keys. When rows of bigger table reach
>   * reducers, rows of smaller table are read back from HDFS and join is
>   * performed.
>   *<p>
>   * This algorithm makes use of one property of Map-Reduce paradigm - Same
> keys
>   * go to one reducer. So this means, rows (of both tables) having same
> values
>   * of columns participating in join will go to same reducer (as join
> columns
>   * are used as Keys) and hence a join can be easily performed at reducer.
>   * Although, it seems reading rows of smaller table from HDFS and
> performing
>   * join will be in-efficient, it is not. There are two reasons for the
> same-
>   * first, Hadoop tries to store a copy of the data at the node which
> produced
>   * that data. So this means, a reducer will be reading a local copy of the
>
>   * rows of smaller table. Second reason is, a Map-Reduce job, sort the
> keys
>   * so this means, rows of both smaller and bigger table are sorted on the
>   * join columns. This makes it easier to read rows and perform join.
>   * <p>
>   * In case of left outer join, left table is always reduced first. In case
>   * of right outer join, right table is reduced first.
>   *
>   * @see http://en.wikipedia.org/wiki/Relational_algebra#Semijoin
>   * @see http://en.wikipedia.org/wiki/Bloom_filter
>   */
> ----------------------------------------------------
>
> Thanks,
> Taran
>
> 2009/3/3 tim robertson <[email protected]>
>
>> Hi Taran,
>>
>> Have you a blog or something that explains how you process joins on
>> cloudbase?
>>
>> E.g. how are indexes used, and how do you go through the joining using
>> the data files and index files.
>> Do you look at all possible indexes, determine the cardinality of each
>> and from this pick a join order, or do you start at both ends of the
>> query and meet in the middle (if you know what I mean...).  This
>> really influences schema design for large datasets in mysql (e.g. you
>> need to store your own cardinality as mysql can't determine the best
>> join order inherently) so I am wondering about porting my reporting
>> application.
>>
>> I think this kind of info would be great for cloudbase docs.
>>
>> Cheers,
>>
>> Tim
>>
>>
>> 2009/3/3 Tarandeep Singh <[email protected]>:
>> > Tim is right. CloudBase is not equivalent to HBase.
>> >
>> > HBase is column oriented database based on Google's BigTable. CloudBase
>> is a
>> > database/data warehosue layer on top of Hadoop and by means of its SQL
>> > interface makes it easier to mine logs. So instead of writing Map-Reduce
>> > jobs for analyzing data, one can use SQL to do the same and SQL to Map
>> > Reduce job translation is handled by CloudBase.
>> >
>> > -Taran
>> >
>> > 2009/3/3 tim robertson <[email protected]>
>> >
>> >> Hi Praveen,
>> >>
>> >> I think it is more equivalent to Hive than HBase - both offer joins
>> >> and structured querying where HBase is more a column oriented data
>> >> store with many to ones embedded in a single row and (currently) only
>> >> indexes on the primary key, but secondary keys are coming.  I
>> >> anticipate using HBase as a back end to harvest into, but might make
>> >> use of Hive or Cloudbase for ad hoc reporting when needed.
>> >>
>> >> Has anyone done any testing of Hive vs. Cloudbase for performance and
>> >> comparison of features?
>> >>
>> >> Cheers,
>> >>
>> >> Tim
>> >>
>> >>
>> >> 2009/3/3 Guttikonda, Praveen <[email protected]>:
>> >> > Hi ,
>> >> > Will this be competing in a sense with HBASE then ?
>> >> >
>> >> > Cheers,
>> >> > Praveen
>> >> >
>> >> > -----Original Message-----
>> >> > From: Tarandeep Singh [mailto:[email protected]]
>> >> > Sent: Tuesday, March 03, 2009 10:12 PM
>> >> > To: [email protected]
>> >> > Subject: Re: Announcing CloudBase-1.2.1 release
>> >> >
>> >> > Hi Lukas,
>> >> >
>> >> > Yes, you are right. As of now, CloudBase does not support unique keys
>> and
>> >> foreign keys on tables. CloudBase is designed as a database abstraction
>> >> layer on top of Hadoop, thus making it easier to query/mine logs/huge
>> data
>> >> easily.
>> >> >
>> >> > -Taran
>> >> >
>> >> >
>> >> > On Tue, Mar 3, 2009 at 1:15 AM, Lukáš Vlček <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Hi Taran,
>> >> >> This looks impressive. I quickly looked at the documentation, am I
>> >> >> right that it does not support unique keys and foreign keys for
>> tables?
>> >> >>
>> >> >> Regards,
>> >> >> Lukas
>> >> >>
>> >> >> On Mon, Mar 2, 2009 at 8:33 PM, Tarandeep Singh <[email protected]
>> >
>> >> >> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > We have just released 1.2.1 version of CloudBase on sourceforge-
>> >> >> > http://cloudbase.sourceforge.net
>> >> >> >
>> >> >> > [ CloudBase is a data warehouse system built on top of Hadoop's
>> >> >> Map-Reduce
>> >> >> > architecture. It uses ANSI SQL as its query language and comes with
>> >> >> > a
>> >> >> JDBC
>> >> >> > driver. It is developed by Business.com and is released to open
>> >> >> > source community under GNU GPL license]
>> >> >> >
>> >> >> > This release fixes one issue with the 1.2 release- Table Indexing
>> >> >> > feature was not enabled in the 1.2 release. This release fixes this
>> >> issue.
>> >> >> >
>> >> >> > Also we have updated the svn repository on the sourceforge site and
>> >> >> > we invite contributors to work with us to improve CloudBase. The
>> svn
>> >> >> > repository url is-
>> >> >> > https://cloudbase.svn.sourceforge.net/svnroot/cloudbase/trunk
>> >> >> >
>> >> >> > We will be uploading Developer's guide/documentation on the
>> >> >> > CloudBase website very soon. Meanwhile, if someone wants to try
>> >> >> > compiling the code and play around with it, please contact me, I
>> can
>> >> >> > help you get started.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Taran
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> http://blog.lukas-vlcek.com/
>> >> >>
>> >> >
>> >>
>> >
>>
>

Reply via email to