I can't see us ever committing a dependency to a custom C++ library to the core, for the same reason that despite passionate advocacy we'll probably never have Scala code in the tree -- saying that potential contributors need to be familiar with Java *and* Language X to contribute to the core is just not worth the technical benefits.
Modern JVMs are surprisingly good at JITing bit manipulation, so the technical case for C++ might not be as strong as you'd initially think. The query planner is dirt simple; it checks the statistics on the indexes available for a query, picks the one that matches the fewest rows, and does a nested loop on any other predicates. Look at ColumnFamilyStore.search to start. (Thought I sent this but found it sitting in my Drafts folder. I blame the new gmail compose crap.) On Mon, Apr 15, 2013 at 11:08 PM, Matt Stump <mrevilgn...@gmail.com> wrote: > I spent some time this afternoon thinking about ways forward. I need to > make progress regardless of whether or not my eventual work makes it into > C*. In order to do so, I was thinking about creating an index management > library and query engine in C++. Because of the nature of bitmap indexes > it's ok if the engine on any node spits out partial results with a list of > indexes it needs in order to complete the query. It would be up to the > caller to get information to the index library, and shuttle around partial > bitmaps if necessary. > > I like this plan for a couple reasons. I get tight the tight control over > memory and the performance of C++. I can wrap a C++ library and make it > accessible to C* via JNA. If the work doesn't make it into C* I can just > wrap it in riak_core and still meet my goals. A library fills an empty > ecosystem niche between full blown databases and collections. > > Unknowns areas of risk in my mind are that I don't know enough about the > implementation of indexes and the query planner in C* to know whether or > not this is crazy. I've read through the old patches for the aborted bitmap > index attempt and it looks pretty simple, but I lack context which is hard > to gather from old disembodied code. Are there any docs for the query > planner or indexes other than code and the bug DB? > > I spent some time benchmarking over the weekend and my prototype is capable > of answering 3 clause queries for fully materialsed 4GB uncompressed > indexes (4B encoded values) in ~240 ms on my laptop. I'm fairly certain I'm > running into memory bandwidth limitations. There might be one or two small > tweaks still available to me, but they're all OS and hardware dependent. > > I would love any feedback, even if it's to say "Matt, you're crazy go home". > > Thanks, > Matt Stump > > > > On Fri, Apr 12, 2013 at 10:41 AM, Matt Stump <mrevilgn...@gmail.com> wrote: > >> It looks like there is some interest so I'm going to disgorge everything >> I've learned/considered in the past couple weeks just so that we have a >> consistant base. I'm going to break down how the indexes work, different >> optimizations and drawbacks and try to address the points/questions that >> people have raised. I've broken it down by subject. >> >> Basic Equality: >> Bitmap indexes are essentially giant arrays of bytes, with one bit per >> possibility. If you want to know what rows have boolean value "event1" set >> to true then you set the address of those rows to 1 in the index. For >> example in index 0100100101, this would mean rows 1, 4, 7 and 9 would >> contain "event1". If you want to know which rows contain event1 and event2 >> then you do a bitwise AND over the two indexes to get the set intersection >> eg. 00100101 AND 10000101 results in 00000101. From this you can build >> complex queries by doing simple set arithmetic. Each of these sets are >> called a query dimension. >> >> Range Queries: >> If you want to encode ranges such as give me all users who have a counter >> in the integer interval [0, 2] then you need a two dimensional bitmap >> index. The first dimension is what values between [0, 7] have been hit: >> 10010011, the second dimension is which rows for each of those possible >> values contain the value. So for value 0 there would be another index >> 00100010, which means that rows 3 and 7 contain value 0. This forms a >> giant two dimensional array. >> >> [1] [00100010] >> [0] [00001010] >> [0] [10100110] >> [1] [00100010] >> [0] [00101010] >> [0] [01101010] >> [0] [00111110] >> [1] [00100110] >> [1] [00100000] >> >> To figure out the answer to who has counter value [0, 2] would be the >> union of the sets [00100010], [00001010], [10100110] which is [10101110]. >> >> Binning: >> Each index has an address size which limits the total number of >> rows/values you can encode in an index. So if you have a 32bit index out >> can encode a total of 4,294,967,295 positions. If you need to encode more >> values than what is possible in that address space you can do two things. >> Increase the address space or perform binning. Binning is essentially >> hash collision, meaning two or more values are assigned to the same value. >> For example if you wanted to index floats you could use the same index as >> above but if you want to know the rows who contain the real numbers [0,1.9] >> then you would need to check the actual value for the rows in the result >> set. This is called a candidate check which is very expensive, often the >> most expensive part of a query. >> >> Segments: >> If you increase the address size of the index then you run into >> space/memory problems. A 32 bit index is 4 GB when fully materialized, a 64 >> bit index is 2 PB. To solve this problem you use segments/partitions or >> compression. Segments work through the use of a sparse table >> or interval list. You break the index up into chunks of equal size lets say >> 1kb. If an bit gets flipped at position 4098, then you go to segment 0x02 >> and if it exists you flip that bit. If that segment doesn't exist then you >> create it and set the bit. The advantage is that you only have to create >> segments that contain flipped bits. The downside is that if you have a wide >> distribution of bits flipped you end up with many segments with one or two >> bits flipped and the is empty space or wasted memory. >> >> Compression: >> The other approach to use compression. The trick is that you need a >> compression algorithm that doesn't require decompression before doing the >> bitwise operations. This means you need a run length encoding (RLE). There >> are 3 leading contenders for king of the hill, CONCISE which is used by >> Druid, WAH which is used by Fastbit, and PWAH which isn't used by anybody I >> think yet. They all work the same way which is to encode store large blocks >> of zeros or ones as encoded values. So you get this index taken from PWAH >> [101 10000000001011 111010000100101 000000000010010] which means 11 >> blocks of 1 bits a literal block of 15 bits and 18 blocks of 0 bits. The >> downside to this approach is that you lose the ability to index directly to >> a position in the index. If you want to perform an update you've either got >> to read/decode to that position and split the index or rebuild the entire >> index. This is why Druid, and Fastbit are very good for static data sets >> but can't deal with fast changing data. >> >> A hybrid approach to performance: >> You can take a hybrid approach to performance which means combine segments >> and compression. You break the index address space up into segments and >> then perform compression within each of those segments to negate the >> problem of many segments with only 1 bit flipped but consuming the entire >> segment size worth of memory. This also limits the cost of having to split >> the encoding for any compressed segment. As segments fill up you can >> compress and join them together using the standard SSTable or levelDB >> approach. >> >> Distributing work across the ring: >> If you use segments and a uniform index size that means you can assign >> segments of the index to different portions of the C* ring. This means that >> one node would have all the segments necessary to answer queries involving >> that portion of the address space. If this also corresponds to the rows it >> has, It means it could answer with certainty that rows in it's address >> space match the query, and to get the entire query results you just merge >> the segments to obtain the final set. >> >> Other optimizations possibly worth considering: >> If you know the size of each index beforehand you can perform the bitwise >> operations across indexes with the indexes sorted by increasing size. If >> you know that the index for event1 only has 1 segment populated that means >> you only need to look at that corresponding segment for the other indexes >> even if those indexes are a fully materialized. There are also some other >> bitmap index types that more efficiently encode range but I don't know >> enough about them yet to render an opinion. >> >> Other optimizations not worth considering: >> There are some other papers on limiting the IO cost of candidate checks >> through optimal binning strategies, but this requires modeling or having >> full knowledge of the distribution of your data which isn't feasible. >> >> Query planning/limitations: >> To be honest I haven't dug into the existing query planner for CQL yet so >> I can't answer to Jonathan's point about it's complexity. What I can say is >> that the query logic for bitmap indexes is pretty simple. You just need to >> break it down into a set of boolean operations over the selected sets, and >> it can be performed in parallel by performing the same operations >> over adjacent segments. If you have knowledge about about segment size then >> you can perform the operations in order to limit the operation size. To >> simply things I would not allow indexing of CQL3 collections. My temptation >> would be to ship with 2D indexes allowing support for range queries (also I >> need it for my use case), but maybe that comes in another iteration. >> Trigram support, fuzzy matching and regex just get layered on top of these >> fundamental concepts, and you can steal the hard bits from existing >> libraries/projects. >> >> Further research: >> Some thought would need to be put into what are the appropriate segment >> size, index size, binning strategy, and hash selection for encoding >> non-numeric values. >> >> That's basically it. I'm sure I've forgotten something. I'm really looking >> forward to any feedback, questions, or criticisms that you may have. I >> would really like to see something like this in C*, and I'm offering up my >> dedicated time to make it happen if I can make the goals/timeline of >> my employer and the project meet up. I've already done things such as put >> the book I was writing on hold in order to make time. I understand if you >> say no, you do have the greater project to consider. If you do say no then >> we will most likely continue with this project as a separate query/index >> engine and C* will just act as the storage layer. If we go alone the >> resulting code may or may not be open source; it hasn't been decided yet. >> >> >> -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced