Re: Block Cache V2 Status?

rahul challapalli Wed, 02 Oct 2013 00:19:07 -0700

Thanks Aaron. I really appreciate your detailed response. I will go through
the code sometime this weekend and see if there is something I can
contribute.


- Rahul


On Tue, Oct 1, 2013 at 5:50 PM, Aaron McCurry <[email protected]> wrote:

> Sorry for my slow response.  I wanted to wait until I had time to write a
> full response (and not say that I would and forget again).
>
> Let me tell you about my use case that doesn't work in the current block
> cache implementation that sort of sparked the v2.
>
> The cluster that I run with Blur has a lot of servers and the servers have
> a lot of ram.  We of course use the block cache to make searching faster,
> however we also use filters for a lot of the searches.  We run a custom
> filter cache (extends o.a.b.manager.BlurFilterCache) that uses an off heap
> bitset filter.  It is off heap for the same reason that the block cache is
> off heap, to reduce GC times.  The problem is that the bitset filter sizes
> change with the number of records (lucene documents) and we regularly add
> large amounts of records/rows.  The block cache is a fixed size so if I
> allow for 80GB of off heap memory and 75GB is for block cache and leave 5GB
> for the bitset filter.  That means that I have to manage how many bitsets
> by size can reside in that 5GB with an ever increasing number of
> records/rows.  While this is possible, I was thinking that is would be nice
> to just have the block cache handle it.  Plus I could write the bitset to
> the directory as a file (one per segment) and only reload it when needed
> (after restart, shard movement, etc).
>
> The problem is that current block cache is a fixed block size (8K by
> default).  That means that each file is logically broken up into blocks of
> size 8K and are loaded and unloaded from the block cache in those fixed
> sizes.  While this seems to work well for the random access nature of
> Lucene searches, it does not work well for sequentially sweeping through
> the bitset file (or any other file for that matter).  In all of my mirco
> benchmarks a "next" call on a bitset that has 10% cardinality or having to
> "advance" in the bitset was over 400% slower than a normal OpenBitSet
> implementation.  I tried to add links in the value object in the LRU to the
> next block to create a link list effect but that had almost effect.  Still
> very slow.
>
> Plus I want to be able to support sorting at some point and in Lucene 4.+
> there is a file based sort that would fit very nicely in the block cache as
> well if the access pattern worked out.
>
> So in short we need to be able to support variable length block sizes based
> on the file or usage pattern.  For some files maybe 4K is better, maybe
> with smaller segments?  Other files maybe 256K block sizes.  Files like the
> bitset files, or the files used for sorting, map the whole file in as a
> single block (or up to some max size of 100s of MB?).  I think is would be
> really awesome to run a tuning program against a HDFS cluster to see what
> block size is best.  The program could create a large files in the GB range
> and perform random accesses on it with variable length buffers ranging from
> 1K to 512K and see what seems to provide the most throughput at the lowest
> latency.  We could then use that size to know how to configure the block
> cache to be most effective.
>
> So the design that's a work in progress in the v2 code is the same
> ConcurrentLinkedHashMap (LRU).  However this time around there are no slabs
> of memory, just the key and value in the LRU.  The LRU is now weighted (
> https://code.google.com/p/concurrentlinkedhashmap/wiki/Design -> Weighted
> Values) so that each block size counts for it's own size.  This mean we can
> now tell the LRU that it has a 80GB size and each 8K block has a weight of
> 8192.  This makes configuration a lot easier and that also gives us our
> variable length block sizes within the same LRU.
>
> The biggest issue with this design is that each value in the LRU (the block
> data) is it's own memory allocation.  I started trying to use a ByteBuffer
> like in the current design but as the number of entries increased the
> performance overhead is too great.  The problem is that the ByteBuffer has
> a couple of objects internal to it and one of them is a Cleaner object that
> is used to deallocate memory from the process as the ByteBuffer is GCed.
>  So with millions of these objects hanging around on the heap GC was a real
> issue.  Also the ByteBuffer allocates one extra page of memory beyond what
> is asked for, so for an 8K block and the normal page size of 4K that's 50%
> waste in every entry.  Not good.
>
> So for ease of testing the value in the LRU is an interface with some basic
> methods read, write, size, etc.  I did this so that I could test an on heap
> implementation that is byte[] based and an off heap that uses Unsafe for
> direct memory access.  One interesting performance increase with the v2
> approach is that there is no memory copying during a cache hit.  In the
> current implementation (v1), when a hit is made the block is copied from
> the off heap memory to the local buffer in the InputIndex.  This is very
> expensive when all you need to read is a single byte, or an int, etc.  So
> in v2 the value is actually referenced in the IndexInput as it's current
> buffer.  It can do this safely by implementing reference counting and a
> deallocation method (only used in the Unsafe impl).
>
> The next question is what is the state of the code committed.  All the
> tests pass, but a large part of the code is disabled because of a bug I
> haven't tracked down yet.
>
>
> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-store/src/test/java/org/apache/blur/store/CacheDirectoryTestSuite.java;h=2085bd0b044470b4e1122c5949e634429c58012c;hb=apache-blur-0.2#l55
>
> Line 55 needs to return true and all the tests need to pass.
>
> Once they do, I will perform a burn-in test on a cluster to validate that
> there are no issues.  After that I would like to move to make it be the
> default implementation.
>
> I hope this helps to explain the reasons behind the new block cache and how
> it's be designed.  Sorry for not posting about it earlier, I got distracted
> with the release of 0.2.0.
>
> Aaron
>
>
>
>
>
>
>
>
> On Tue, Oct 1, 2013 at 12:24 PM, rahul challapalli <
> [email protected]> wrote:
>
> > Hi Aaron,
> >
> > What is the status of the new implementation? I couldn't find any
> > discussion on this new implementation and its design. If I missed a
> > particular email thread which discussed the goals of V2 can you please
> > point me to that thread, if not can you elaborate your thoughts, goals
> and
> > what needs to be done? Thank You.
> >
> > - Rahul
> >
>

Re: Block Cache V2 Status?

Reply via email to