Re: Queries with rowId possible?

Aaron McCurry Tue, 12 Nov 2013 16:12:06 -0800

On Mon, Nov 11, 2013 at 1:54 AM, Ravikumar Govindarajan <
[email protected]> wrote:


> As you pointed out, there will be some penalty for this cache, especially
> when the number of rowids increases. Interacting with this cache during
> IndexReader open/close is going to have some overhead.
>
> Instead, can we decouple this and make it a "write-through-cache"?
>
> Ex: Map<SegName, Ref-Counted-PrimeDocBitSet>
>
> Codec will publish new data to this cache on flush[new-segment-creation].
>
> Every access can be ref-counted and during segment removal [merges],
> obsolete entries can be queued and removed from the cache, if ref-count
> drops to zero.
>
> Typically I feel that this cache should be free of IndexReader open/close,
> but rather live till BlurNRTIndex.close() is called. Then the over-head is
> really minimal
>

Not sure I follow you here.  Are you talking about the file based bitsets
that used to back the per segment filters?  If so then I think they already
live with the shard (BlurNRTIndex.close() as well as the segment).  So if
the segment is still living the filters can bee accessed.  If the filter is
used, it's pulled into memory.  If the filter is written, the block cache
already setup to be a write through cache.

If I got this all wrong can you describe things again?  :-)

Thanks,
Aaron


>
> What do you think?
>
> --
> Ravi
>
>
>
>
> On Sat, Nov 9, 2013 at 9:52 AM, Aaron McCurry <[email protected]> wrote:
>
> > On Fri, Nov 8, 2013 at 2:22 AM, Ravikumar Govindarajan <
> > [email protected]> wrote:
> >
> > > Wow, this saving of filters in a custom-codec is super-cool.
> > >
> > > Let me describe the problem I was thinking about.
> > >
> > > Assuming we have the RAMDir and Disk swap approach,  I was just
> starting
> > to
> > > deliberate on the Read path.
> > >
> > > PrimeDocCache looks like a challenge for this approach, as the same row
> > > will now be present across multiple segments. Each segment will have a
> > > "PrimeDoc" field per-row, but during merge this info gets duplicated
> for
> > > each row.
> > >
> > > I was thinking of recording the "start-doc" of each row to a separate
> > file,
> > > via a custom codec, like you have done for FilterCache.
> > >
> > > During warm-up, it can read the entire file containing "start-docs" and
> > > populate the PrimeDocCache.
> > >
> >
> > I like the idea, I tend to prototype to figure out how hard and how
> > performant  a solution will be.  :-)  Let's see if we can make it work.
> >
> > Aaron
> >
> >
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> > >
> > > On Fri, Nov 8, 2013 at 5:04 AM, Aaron McCurry <[email protected]>
> > wrote:
> > >
> > > > So filter cache is really just a place holder for keeping Lucene
> > Filters
> > > > around between queries.  The DefaultFilterCache class does nothing,
> > > however
> > > > I have implemented one that make use of regularly.
> > > >
> > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-core/src/main/java/org/apache/blur/manager/AliasBlurFilterCache.java;h=92491d0ceb3e7ce09902110e3bac5fa485959dab;hb=apache-blur-0.2
> > > >
> > > > If you write your own and you want to build a logical bitset cache
> for
> > > the
> > > > filter (so it's faster) take a look at the
> > > > "org.apache.blur.filter.FilterCache"
> > > > class.  It wraps an existing filter, loads it into the block cache
> and
> > > > writes it disk (via the Directory).  The filters live with the
> segment
> > so
> > > > if the segment gets removed so will the on disk "filter" and the
> > > in-memory
> > > > cache of it.
> > > >
> > > > On Thu, Nov 7, 2013 at 8:08 AM, Ravikumar Govindarajan <
> > > > [email protected]> wrote:
> > > >
> > > > > Great. In such a case, it will benefit me for doing a "rowid"
> > > > filter-cache.
> > > > >
> > > > > I saw Blur having a DefaultFilterCache class. Is this the class
> that
> > > need
> > > > > to be customized? Will NRT re-opens [reader-close/open, with
> > > > > applyAllDeletes] take care of auto-invalidating such a cache?
> > > > >
> > > >
> > > > Filtering is a query operation so for each new segment (NRT re-opens)
> > the
> > > > Lucene Filter API handles creating a new new filter for that segment.
> > >  The
> > > > delete operations are up to how you code the Filter.  But that's all
> > > Lucene
> > > > code.
> > > >
> > > > The DefaultFilterCache just allows you to cache the filter objects
> > > > themselves and it provides callbacks when table/shards are opened and
> > > > closed.
> > > >
> > > > Aaron
> > > >
> > > >
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > >
> > > > > On Thu, Nov 7, 2013 at 5:44 PM, Aaron McCurry <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Yes.  But I believe the "rowId" needs to be "rowid".
> > > > > >
> > > > > > Aaron
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 7, 2013 at 5:16 AM, Ravikumar Govindarajan <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Does Blur permit queries with rowId?
> > > > > > >
> > > > > > > Ex:
> > > > > > > docs.body:hello AND rowId:123
> > > > > > >
> > > > > > > Is it possible to optimize such queries with filter-caching
> > etc...?
> > > > > > >
> > > > > > > --
> > > > > > > Ravi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Queries with rowId possible?

Reply via email to