I can't see us ever committing a dependency to a custom C++ library to
the core, for the same reason that despite passionate advocacy we'll
probably never have Scala code in the tree -- saying that potential
contributors need to be familiar with Java *and* Language X to
contribute to the core is just not worth the technical benefits.

Modern JVMs are surprisingly good at JITing bit manipulation, so the
technical case for C++ might not be as strong as you'd initially
think.

The query planner is dirt simple; it checks the statistics on the
indexes available for a query, picks the one that matches the fewest
rows, and does a nested loop on any other predicates.  Look at
ColumnFamilyStore.search to start.

(Thought I sent this but found it sitting in my Drafts folder.  I
blame the new gmail compose crap.)

On Mon, Apr 15, 2013 at 11:08 PM, Matt Stump <mrevilgn...@gmail.com> wrote:
> I spent some time this afternoon thinking about ways forward. I need to
> make progress regardless of whether or not my eventual work makes it into
> C*. In order to do so, I was thinking about creating an index management
> library and query engine in C++. Because of the nature of bitmap indexes
> it's ok if the engine on any node spits out partial results with a list of
> indexes it needs in order to complete the query. It would be up to the
> caller to get information to the index library, and shuttle around partial
> bitmaps if necessary.
>
> I like this plan for a couple reasons. I get tight the tight control over
> memory and the performance of C++. I can wrap a C++ library and make it
> accessible to C* via JNA. If the work doesn't make it into C* I can just
> wrap it in riak_core and still meet my goals. A library fills an empty
> ecosystem niche between full blown databases and collections.
>
> Unknowns areas of risk in my mind are that I don't know enough about the
> implementation of indexes and the query planner in C* to know whether or
> not this is crazy. I've read through the old patches for the aborted bitmap
> index attempt and it looks pretty simple, but I lack context which is hard
> to gather from old disembodied code. Are there any docs for the query
> planner or indexes other than code and the bug DB?
>
> I spent some time benchmarking over the weekend and my prototype is capable
> of answering 3 clause queries for fully materialsed 4GB uncompressed
> indexes (4B encoded values) in ~240 ms on my laptop. I'm fairly certain I'm
> running into memory bandwidth limitations. There might be one or two small
> tweaks still available to me, but they're all OS and hardware dependent.
>
> I would love any feedback, even if it's to say "Matt, you're crazy go home".
>
> Thanks,
> Matt Stump
>
>
>
> On Fri, Apr 12, 2013 at 10:41 AM, Matt Stump <mrevilgn...@gmail.com> wrote:
>
>> It looks like there is some interest so I'm going to disgorge everything
>> I've learned/considered in the past couple weeks just so that we have a
>> consistant base. I'm going to break down how the indexes work, different
>> optimizations and drawbacks and try to address the points/questions that
>> people have raised. I've broken it down by subject.
>>
>> Basic Equality:
>> Bitmap indexes are essentially giant arrays of bytes, with one bit per
>> possibility. If you want to know what rows have boolean value "event1" set
>> to true then you set the address of those rows to 1 in the index. For
>> example in index 0100100101, this would mean rows 1, 4, 7 and 9 would
>> contain "event1". If you want to know which rows contain event1 and event2
>> then you do a bitwise AND over the two indexes to get the set intersection
>> eg. 00100101 AND 10000101 results in 00000101. From this you can build
>> complex queries by doing simple set arithmetic. Each of these sets are
>> called a query dimension.
>>
>> Range Queries:
>> If you want to encode ranges such as give me all users who have a counter
>> in the integer interval [0, 2] then you need a two dimensional bitmap
>> index. The first dimension is what values between [0, 7] have been hit:
>> 10010011, the second dimension is which rows for each of those possible
>> values contain the value. So for value 0 there would be another index
>> 00100010, which means that rows 3 and 7 contain value 0.  This forms a
>> giant two dimensional array.
>>
>> [1] [00100010]
>> [0] [00001010]
>> [0] [10100110]
>> [1] [00100010]
>> [0] [00101010]
>> [0] [01101010]
>> [0] [00111110]
>> [1] [00100110]
>> [1] [00100000]
>>
>> To figure out the answer to who has counter value [0, 2] would be the
>> union of the sets  [00100010], [00001010], [10100110] which is [10101110].
>>
>> Binning:
>> Each index has an address size which limits the total number of
>> rows/values you can encode in an index. So if you have a 32bit index out
>> can encode a total of 4,294,967,295 positions. If you need to encode more
>> values than what is possible in that address space you can do two things.
>> Increase the address space or perform binning. Binning is essentially
>> hash collision, meaning two or more values are assigned to the same value.
>> For example if you wanted to index floats you could use the same index as
>> above but if you want to know the rows who contain the real numbers [0,1.9]
>> then you would need to check the actual value for the rows in the result
>> set. This is called a candidate check which is very expensive, often the
>> most expensive part of a query.
>>
>> Segments:
>> If you increase the address size of the index then you run into
>> space/memory problems. A 32 bit index is 4 GB when fully materialized, a 64
>> bit index is 2 PB. To solve this problem you use segments/partitions or
>> compression. Segments work through the use of a sparse table
>> or interval list. You break the index up into chunks of equal size lets say
>> 1kb. If an bit gets flipped at position 4098, then you go to segment 0x02
>> and if it exists you flip that bit. If that segment doesn't exist then you
>> create it and set the bit. The advantage is that you only have to create
>> segments that contain flipped bits. The downside is that if you have a wide
>> distribution of bits flipped you end up with many segments with one or two
>> bits flipped and the is empty space or wasted memory.
>>
>> Compression:
>> The other approach to use compression. The trick is that you need a
>> compression algorithm that doesn't require decompression before doing the
>> bitwise operations. This means you need a run length encoding (RLE). There
>> are 3 leading contenders for king of the hill, CONCISE which is used by
>> Druid, WAH which is used by Fastbit, and PWAH which isn't used by anybody I
>> think yet. They all work the same way which is to encode store large blocks
>> of zeros or ones as encoded values.  So you get this index taken from PWAH
>> [101 10000000001011 111010000100101 000000000010010] which means 11
>> blocks of 1 bits a literal block of 15 bits and 18 blocks of 0 bits. The
>> downside to this approach is that you lose the ability to index directly to
>> a position in the index. If you want to perform an update you've either got
>> to read/decode to that position and split the index or rebuild the entire
>> index. This is why Druid, and Fastbit are very good for static data sets
>> but can't deal with fast changing data.
>>
>> A hybrid approach to performance:
>> You can take a hybrid approach to performance which means combine segments
>> and compression. You break the index address space up into segments and
>> then perform compression within each of those segments to negate the
>> problem of many segments with only 1 bit flipped but consuming the entire
>> segment size worth of memory. This also limits the cost of having to split
>> the encoding for any compressed segment. As segments fill up you can
>> compress and join them together using the standard SSTable or levelDB
>> approach.
>>
>> Distributing work across the ring:
>> If you use segments and a uniform index size that means you can assign
>> segments of the index to different portions of the C* ring. This means that
>> one node would have all the segments necessary to answer queries involving
>> that portion of the address space. If this also corresponds to the rows it
>> has, It means it could answer with certainty that rows in it's address
>> space match the query, and to get the entire query results you just merge
>> the segments to obtain the final set.
>>
>> Other optimizations possibly worth considering:
>> If you know the size of each index beforehand you can perform the bitwise
>> operations across indexes with the indexes sorted by increasing size.  If
>> you know that the index for event1 only has 1 segment populated that means
>> you only need to look at that corresponding segment for the other indexes
>> even if those indexes are a fully materialized. There are also some other
>> bitmap index types that more efficiently encode range but I don't know
>> enough about them yet to render an opinion.
>>
>> Other optimizations not worth considering:
>> There are some other papers on limiting the IO cost of candidate checks
>> through optimal binning strategies, but this requires modeling or having
>> full knowledge of the distribution of your data which isn't feasible.
>>
>> Query planning/limitations:
>> To be honest I haven't dug into the existing query planner for CQL yet so
>> I can't answer to Jonathan's point about it's complexity. What I can say is
>> that the query logic for bitmap indexes is pretty simple. You just need to
>> break it down into a set of boolean operations over the selected sets, and
>> it can be performed in parallel by performing the same operations
>> over adjacent segments. If you have knowledge about about segment size then
>> you can perform the operations in order to limit the operation size. To
>> simply things I would not allow indexing of CQL3 collections. My temptation
>> would be to ship with 2D indexes allowing support for range queries (also I
>> need it for my use case), but maybe that comes in another iteration.
>> Trigram support, fuzzy matching and regex just get layered on top of these
>> fundamental concepts, and you can steal the hard bits from existing
>> libraries/projects.
>>
>> Further research:
>> Some thought would need to be put into what are the appropriate segment
>> size, index size, binning strategy, and hash selection for encoding
>> non-numeric values.
>>
>> That's basically it. I'm sure I've forgotten something. I'm really looking
>> forward to any feedback, questions, or criticisms that you may have. I
>> would really like to see something like this in C*, and I'm offering up my
>> dedicated time to make it happen if I can make the goals/timeline of
>> my employer and the project meet up. I've already done things such as put
>> the book I was writing on hold in order to make time. I understand if you
>> say no, you do have the greater project to consider. If you do say no then
>> we will most likely continue with this project as a separate query/index
>> engine and C* will just act as the storage layer. If we go alone the
>> resulting code may or may not be open source; it hasn't been decided yet.
>>
>>
>>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced

Reply via email to