Thanks Aaron. I really appreciate your detailed response. I will go through the code sometime this weekend and see if there is something I can contribute.
- Rahul On Tue, Oct 1, 2013 at 5:50 PM, Aaron McCurry <[email protected]> wrote: > Sorry for my slow response. I wanted to wait until I had time to write a > full response (and not say that I would and forget again). > > Let me tell you about my use case that doesn't work in the current block > cache implementation that sort of sparked the v2. > > The cluster that I run with Blur has a lot of servers and the servers have > a lot of ram. We of course use the block cache to make searching faster, > however we also use filters for a lot of the searches. We run a custom > filter cache (extends o.a.b.manager.BlurFilterCache) that uses an off heap > bitset filter. It is off heap for the same reason that the block cache is > off heap, to reduce GC times. The problem is that the bitset filter sizes > change with the number of records (lucene documents) and we regularly add > large amounts of records/rows. The block cache is a fixed size so if I > allow for 80GB of off heap memory and 75GB is for block cache and leave 5GB > for the bitset filter. That means that I have to manage how many bitsets > by size can reside in that 5GB with an ever increasing number of > records/rows. While this is possible, I was thinking that is would be nice > to just have the block cache handle it. Plus I could write the bitset to > the directory as a file (one per segment) and only reload it when needed > (after restart, shard movement, etc). > > The problem is that current block cache is a fixed block size (8K by > default). That means that each file is logically broken up into blocks of > size 8K and are loaded and unloaded from the block cache in those fixed > sizes. While this seems to work well for the random access nature of > Lucene searches, it does not work well for sequentially sweeping through > the bitset file (or any other file for that matter). In all of my mirco > benchmarks a "next" call on a bitset that has 10% cardinality or having to > "advance" in the bitset was over 400% slower than a normal OpenBitSet > implementation. I tried to add links in the value object in the LRU to the > next block to create a link list effect but that had almost effect. Still > very slow. > > Plus I want to be able to support sorting at some point and in Lucene 4.+ > there is a file based sort that would fit very nicely in the block cache as > well if the access pattern worked out. > > So in short we need to be able to support variable length block sizes based > on the file or usage pattern. For some files maybe 4K is better, maybe > with smaller segments? Other files maybe 256K block sizes. Files like the > bitset files, or the files used for sorting, map the whole file in as a > single block (or up to some max size of 100s of MB?). I think is would be > really awesome to run a tuning program against a HDFS cluster to see what > block size is best. The program could create a large files in the GB range > and perform random accesses on it with variable length buffers ranging from > 1K to 512K and see what seems to provide the most throughput at the lowest > latency. We could then use that size to know how to configure the block > cache to be most effective. > > So the design that's a work in progress in the v2 code is the same > ConcurrentLinkedHashMap (LRU). However this time around there are no slabs > of memory, just the key and value in the LRU. The LRU is now weighted ( > https://code.google.com/p/concurrentlinkedhashmap/wiki/Design -> Weighted > Values) so that each block size counts for it's own size. This mean we can > now tell the LRU that it has a 80GB size and each 8K block has a weight of > 8192. This makes configuration a lot easier and that also gives us our > variable length block sizes within the same LRU. > > The biggest issue with this design is that each value in the LRU (the block > data) is it's own memory allocation. I started trying to use a ByteBuffer > like in the current design but as the number of entries increased the > performance overhead is too great. The problem is that the ByteBuffer has > a couple of objects internal to it and one of them is a Cleaner object that > is used to deallocate memory from the process as the ByteBuffer is GCed. > So with millions of these objects hanging around on the heap GC was a real > issue. Also the ByteBuffer allocates one extra page of memory beyond what > is asked for, so for an 8K block and the normal page size of 4K that's 50% > waste in every entry. Not good. > > So for ease of testing the value in the LRU is an interface with some basic > methods read, write, size, etc. I did this so that I could test an on heap > implementation that is byte[] based and an off heap that uses Unsafe for > direct memory access. One interesting performance increase with the v2 > approach is that there is no memory copying during a cache hit. In the > current implementation (v1), when a hit is made the block is copied from > the off heap memory to the local buffer in the InputIndex. This is very > expensive when all you need to read is a single byte, or an int, etc. So > in v2 the value is actually referenced in the IndexInput as it's current > buffer. It can do this safely by implementing reference counting and a > deallocation method (only used in the Unsafe impl). > > The next question is what is the state of the code committed. All the > tests pass, but a large part of the code is disabled because of a bug I > haven't tracked down yet. > > > https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-store/src/test/java/org/apache/blur/store/CacheDirectoryTestSuite.java;h=2085bd0b044470b4e1122c5949e634429c58012c;hb=apache-blur-0.2#l55 > > Line 55 needs to return true and all the tests need to pass. > > Once they do, I will perform a burn-in test on a cluster to validate that > there are no issues. After that I would like to move to make it be the > default implementation. > > I hope this helps to explain the reasons behind the new block cache and how > it's be designed. Sorry for not posting about it earlier, I got distracted > with the release of 0.2.0. > > Aaron > > > > > > > > > On Tue, Oct 1, 2013 at 12:24 PM, rahul challapalli < > [email protected]> wrote: > > > Hi Aaron, > > > > What is the status of the new implementation? I couldn't find any > > discussion on this new implementation and its design. If I missed a > > particular email thread which discussed the goals of V2 can you please > > point me to that thread, if not can you elaborate your thoughts, goals > and > > what needs to be done? Thank You. > > > > - Rahul > > >
