On Mon, 1 Feb 2010 09:42:16 -0600 Jonathan Ellis <jbel...@gmail.com> wrote:
JE> 2010/2/1 Ted Zlatanov <t...@lifelogs.com>: >> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <t...@lifelogs.com> wrote: >> TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jbel...@gmail.com> wrote: JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya JE> <meharchaita...@gmail.com> wrote: >>>>> 1. This would lead to enourmous amount of duplication of data, in short >>>>> if I now want to view the data from IS_PUBLISHED dimenstion then my >>>>> database >>>>> size would scale up tremendously. >> JE> Yes. But disk space is so cheap it's worth using a lot of it to make JE> other things fast. >> TZ> IIUC, Mehar would be duplicating the article data for every article tag. >> TZ> I searched the bug tracker and wiki and didn't find anything on the TZ> topic of tag storage and search, so I don't think Cassandra supports TZ> tags without data duplication. >> TZ> Would it be possible to implement an optional byte[] bitmap field in TZ> SliceRange? If you can specify the bitmap as an optional field it would TZ> not break current clients. Then the search can return only the subset TZ> of the range that matches the bitmap. This would make sense for TZ> BytesType and LongType, at least. >> >> I looked at the source code and it seems that >> StorageProxy::getSliceRange() is the focal point for reads and bitmap >> matching should be implemented there. The bitmap could be applied as a >> filter before the other SliceRange parameters, especially the max number >> of return results. It may be worth the effort to send the bitmap down >> to the ReadCommand/ColumnFamily level to reduce the number of potential >> matches. >> >> If this is not feasible for technical reasons I'd like to know. >> Otherwise I'll put it on my TODO list and produce a proposal (unless >> someone more knowledgeable is interested, of course). JE> how would this be different then the byte[] column name you can JE> already match on? Given byte columns A 0110 B 0111 C 0101 the bitmask approach would let you specify a bitmask of "0011" and get only B. It's just an AND that looks for a non-zero value. So you can say "0111" and get A, B, and C. Or "0010" to get A and B. "1000" gets nothing. Cassandra could support OR-ed multiples for better queries, so you could ask for (0001,0010) to get A, B, and C. Ted