On Mon, 1 Feb 2010 09:42:16 -0600 Jonathan Ellis <jbel...@gmail.com> wrote: 

JE> 2010/2/1 Ted Zlatanov <t...@lifelogs.com>:
>> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <t...@lifelogs.com> wrote:
>> 
TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jbel...@gmail.com> wrote:
JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
JE> <meharchaita...@gmail.com> wrote:
>>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my 
>>>>> database
>>>>>   size would scale up tremendously.
>> 
JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
JE> other things fast.
>> 
TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>> 
TZ> I searched the bug tracker and wiki and didn't find anything on the
TZ> topic of tag storage and search, so I don't think Cassandra supports
TZ> tags without data duplication.
>> 
TZ> Would it be possible to implement an optional byte[] bitmap field in
TZ> SliceRange?  If you can specify the bitmap as an optional field it would
TZ> not break current clients.  Then the search can return only the subset
TZ> of the range that matches the bitmap.  This would make sense for
TZ> BytesType and LongType, at least.
>> 
>> I looked at the source code and it seems that
>> StorageProxy::getSliceRange() is the focal point for reads and bitmap
>> matching should be implemented there.  The bitmap could be applied as a
>> filter before the other SliceRange parameters, especially the max number
>> of return results.  It may be worth the effort to send the bitmap down
>> to the ReadCommand/ColumnFamily level to reduce the number of potential
>> matches.
>> 
>> If this is not feasible for technical reasons I'd like to know.
>> Otherwise I'll put it on my TODO list and produce a proposal (unless
>> someone more knowledgeable is interested, of course).

JE> how would this be different then the byte[] column name you can
JE> already match on?

Given byte columns

A 0110
B 0111
C 0101

the bitmask approach would let you specify a bitmask of "0011" and get
only B.  It's just an AND that looks for a non-zero value.  So you can
say "0111" and get A, B, and C.  Or "0010" to get A and B.  "1000" gets
nothing.

Cassandra could support OR-ed multiples for better queries, so you could
ask for (0001,0010) to get A, B, and C.

Ted

Reply via email to