Sorry for my slow response. I wanted to wait until I had time to write a full response (and not say that I would and forget again).
Let me tell you about my use case that doesn't work in the current block cache implementation that sort of sparked the v2. The cluster that I run with Blur has a lot of servers and the servers have a lot of ram. We of course use the block cache to make searching faster, however we also use filters for a lot of the searches. We run a custom filter cache (extends o.a.b.manager.BlurFilterCache) that uses an off heap bitset filter. It is off heap for the same reason that the block cache is off heap, to reduce GC times. The problem is that the bitset filter sizes change with the number of records (lucene documents) and we regularly add large amounts of records/rows. The block cache is a fixed size so if I allow for 80GB of off heap memory and 75GB is for block cache and leave 5GB for the bitset filter. That means that I have to manage how many bitsets by size can reside in that 5GB with an ever increasing number of records/rows. While this is possible, I was thinking that is would be nice to just have the block cache handle it. Plus I could write the bitset to the directory as a file (one per segment) and only reload it when needed (after restart, shard movement, etc). The problem is that current block cache is a fixed block size (8K by default). That means that each file is logically broken up into blocks of size 8K and are loaded and unloaded from the block cache in those fixed sizes. While this seems to work well for the random access nature of Lucene searches, it does not work well for sequentially sweeping through the bitset file (or any other file for that matter). In all of my mirco benchmarks a "next" call on a bitset that has 10% cardinality or having to "advance" in the bitset was over 400% slower than a normal OpenBitSet implementation. I tried to add links in the value object in the LRU to the next block to create a link list effect but that had almost effect. Still very slow. Plus I want to be able to support sorting at some point and in Lucene 4.+ there is a file based sort that would fit very nicely in the block cache as well if the access pattern worked out. So in short we need to be able to support variable length block sizes based on the file or usage pattern. For some files maybe 4K is better, maybe with smaller segments? Other files maybe 256K block sizes. Files like the bitset files, or the files used for sorting, map the whole file in as a single block (or up to some max size of 100s of MB?). I think is would be really awesome to run a tuning program against a HDFS cluster to see what block size is best. The program could create a large files in the GB range and perform random accesses on it with variable length buffers ranging from 1K to 512K and see what seems to provide the most throughput at the lowest latency. We could then use that size to know how to configure the block cache to be most effective. So the design that's a work in progress in the v2 code is the same ConcurrentLinkedHashMap (LRU). However this time around there are no slabs of memory, just the key and value in the LRU. The LRU is now weighted ( https://code.google.com/p/concurrentlinkedhashmap/wiki/Design -> Weighted Values) so that each block size counts for it's own size. This mean we can now tell the LRU that it has a 80GB size and each 8K block has a weight of 8192. This makes configuration a lot easier and that also gives us our variable length block sizes within the same LRU. The biggest issue with this design is that each value in the LRU (the block data) is it's own memory allocation. I started trying to use a ByteBuffer like in the current design but as the number of entries increased the performance overhead is too great. The problem is that the ByteBuffer has a couple of objects internal to it and one of them is a Cleaner object that is used to deallocate memory from the process as the ByteBuffer is GCed. So with millions of these objects hanging around on the heap GC was a real issue. Also the ByteBuffer allocates one extra page of memory beyond what is asked for, so for an 8K block and the normal page size of 4K that's 50% waste in every entry. Not good. So for ease of testing the value in the LRU is an interface with some basic methods read, write, size, etc. I did this so that I could test an on heap implementation that is byte[] based and an off heap that uses Unsafe for direct memory access. One interesting performance increase with the v2 approach is that there is no memory copying during a cache hit. In the current implementation (v1), when a hit is made the block is copied from the off heap memory to the local buffer in the InputIndex. This is very expensive when all you need to read is a single byte, or an int, etc. So in v2 the value is actually referenced in the IndexInput as it's current buffer. It can do this safely by implementing reference counting and a deallocation method (only used in the Unsafe impl). The next question is what is the state of the code committed. All the tests pass, but a large part of the code is disabled because of a bug I haven't tracked down yet. https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-store/src/test/java/org/apache/blur/store/CacheDirectoryTestSuite.java;h=2085bd0b044470b4e1122c5949e634429c58012c;hb=apache-blur-0.2#l55 Line 55 needs to return true and all the tests need to pass. Once they do, I will perform a burn-in test on a cluster to validate that there are no issues. After that I would like to move to make it be the default implementation. I hope this helps to explain the reasons behind the new block cache and how it's be designed. Sorry for not posting about it earlier, I got distracted with the release of 0.2.0. Aaron On Tue, Oct 1, 2013 at 12:24 PM, rahul challapalli < [email protected]> wrote: > Hi Aaron, > > What is the status of the new implementation? I couldn't find any > discussion on this new implementation and its design. If I missed a > particular email thread which discussed the goals of V2 can you please > point me to that thread, if not can you elaborate your thoughts, goals and > what needs to be done? Thank You. > > - Rahul >
