[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

Michael McCandless (JIRA) Mon, 04 Jan 2010 06:54:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796200#action_12796200
 ]


Michael McCandless commented on LUCENE-2186:
--------------------------------------------

bq. Is this patch for flex, as it contains CodecUtils and so on?

Actually it's intended for trunk; I was thinking this should land
before flex (it's a much smaller change, and it's "isolated" from
flex), and so I wrote the CodecUtil/BytesRef basic infrastructure,
thinking flex would then cutover to them.

{quote}
Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?
{quote}

The intention here is for this ("index values") to replace field
cache, but not aim (initially at least) to do much more.  Ie, it's
"meant" to be a RAM resident (either via explicit slurping-into-RAM or
via MMAP).  So the SSD or spinning magnets should not be hit on
retrieval.

If we add an iterator API, I think it should be simpler than the
postings API (ie, no seeking, dense (every doc is visited,
sequentially) iteration).

{quote}
It looks like ByteRef is very similar to Payload? Could you use that instead 
and extend it with the new String constructor and compare methods?
{quote}

Good point!  I agree.  Also, we should use BytesRef when reading the
payload from TermsEnum.  Actually I think Payload, BytesRef, TermRef
(in flex) should all eventually be merged; of the three names, I think
I like BytesRef the best.  With *Enum in flex we can switch to
BytesRef.  For analysis we should switch PayloadAttribute to BytesRef,
and deprecate the methods using Payload?  Hmmm... but PayloadAttribute
is an interface.

{quote}
So it looks like with your approach you want to support certain
"primitive" types out of the box, such as byte[], float, int, String?
{quote}

Actually, all "primitive" types (ie, byte/short/int/long are
"included" under int, as well as arbitrary bit precision "between"
those primitive types).  Because the API uses a method invocation (eg
IntSource.get) instead of direct array access, we can "hide" how many
bits are actually used, under the impl.  Same is true for float/double
(except we can't [easily] do arbitrary bit precision here... just 4 or
8 bytes).

{quote}
If someone has custom data types, then they have, similar as with
payloads today, the byte[] indirection?
{quote}

Right, byte[] is for String, but also for arbitrary (opaque to Lucene)
extensibility.  The six anonymous (separate package private classes)
concrete impls should give good efficiency to fit the different use
cases.

{quote}
The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte[]
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
similarly?
{quote}

This is compelling (letting Attrs read/write directly), but, I have
some questions:

  * How would the random-access API work?  (Attrs are designed for
    iteration).  Eg, just providing IndexInput/Output to the Attr
    isn't quite enough -- the encoding is sometimes context dependent
    (like frq writes the delta between docIDs, the symbol table needed
    when reading/writing deref/sorted).  How would I build a random
    access API on top of that?  captureState-per-doc is too costly.
    What API would be used to write the shared state, ie, to tell the
    Attr "we now are writing the segment, so you need to dump the
    symbol table".

  * How would the packed ints work?  EG say my ints only need 5 bits.
    (Attrs are sort of designed for one-value-at-once).

  * How would the "symbol table" based encodings (deref, sorted) work?
    I guess the attr would need to have some state associated with
    it, and when I first create the attr I need to pass it segment
    name, Directory, etc, so it opens the right files?

  * I'm thinking we should still directly support native types, ie,
    Attrs are there for extensibility beyond native types?

  * Exposing single attr across a multi reader sounds tricky --
    LUCENE-2154 (and, we need this for flex, which is worrying me!).
    But it sounds like you and Uwe are making some progress on that
    (using some under-the-hood Java reflection magic)... and this
    doesn't directly affect this issue, assuming we don't expose this
    API at the MultiReader level.

{quote}
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for primitive types, such as float?
Could we efficiently use such an approach all the way up to
FieldCache? It would be compelling if you could store an attribute as
CSF, or in the postinglist, retrieve it from the flex APIs, and also
from the FieldCache. All would be the same API and there would only be
one place that needs to "know" about the encoding (the attribute).
{quote}

This is the grand unification of everything :)  I like it, but, I
don't want that future utopia to stall our progress today... ie I'd
rather do something simple yet concrete, now, and then work step by
step towards that future ("progress not perfection").

That said, if we can get some bite sized step in, today, towards that
future, that'd be good.

Eg, the current patch only supports "dense" storage, ie it's assumed
every document will have a value, because it's aiming to replace field
cache.  If we wanted to add sparse storage... I think that'd
require/strongly encourage access via a postings-like iteration API,
which I don't see how to take a baby step towards :)

I do think it would be compelling for an Attr to "only" have to expose
read/write methods, and then the Attr can be stored in CSF or
postings, but I don't see how to make an efficient random-access API
on top of that.  I think it's in LUCENE-2125 where we should explore
this.

Norms and deleted docs should be able to eventually switch to CSF.

In fact, norms should just be a FloatSource, with default impl being
the 1-byte float encoding we use today.  This then gives apps full
flexibility to plugin their own FloatSource.

For deleted docs we should probably create a BoolSource.

{quote}
About updating CSF: I hope we can use parallel indexing for that. In
other words: It should be possible for users to use parallel indexes
to update certain fields, and Lucene should use the same approach
internally to store different "generations" of things like norms and
CSFs.
{quote}

That sounds great, though, I think we need a more efficient way to
store the changes.  Ie, norms rewrites all norms on any change, which
is costly.  It'd be better to have some sort of delta format, where
you sparsely encode docID + new value, and then when loading we merge
those on the fly (and, segment merging periodically also merges &
commits them).

{quote}
Yeah, that's where I got kind of stuck with 1231: We need to figure
out how the public API should look like, with which a user can add CSF
values to the index and retrieve them. The easiest and fastest way
would be to add a dedicated new API. The cleaner one would be to make the whole
Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt.
{quote}

Right, but LUCENE-1597 is another good but far-away-from-landing
goal.  I think a dedicated API is fine for the atomic types.  Field
cache today is a dedicated API...

I guess to sum up my thoughts now (but I'm still mulling...):

  * I think the random-access-field-cache-like-API should be separate
    from the designed-for-iteration-from-a-file postings API.

  * Attrs for extensibilty could be compelling, but I don't see how to
    build an [efficient] random access API on top of Attrs.  It would
    be very elegant only having to add a read/write method to your
    Attr, but, that's not really enough for a full codec.

  * I don't think we should hold up adding direct support for atomic
    types until/if we can figure out how to add Attrs.  Ie I think we
    should do this in two steps.  The current patch is [roughly] step
    1, and I think should be a compelling replacement for field cache.
    Memory usage and GC cost of string sorting should be much lower
    than field cache.

I'm also still mulling on these issues w/ the current patch:

  * How could we use index values to efficiently maintain stats needed
    for flexible scoring (LUCENE-2187).

  * Current patch doesn't handle merging yet.

  * Could norms/deleted docs "conceivably" cutover to index values
    API?

  * What "dedicated API" for indexing & sorting.

  * Run basic perf tests to see cost of using method instead of direct
    array.


> First cut at column-stride fields (index values storage)
> --------------------------------------------------------
>
>                 Key: LUCENE-2186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2186
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2186.patch
>
>
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
>     variable length, and stored either "straight" (good for eg a
>     "title" field), "deref" (good when many docs share the same value,
>     but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
>     store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> BytesRef, which does the same thing as flex's TermRef (we should merge
> them).
> This patch also adds basic initial impl of PackedInts (LUCENE-1990);
> we can swap that out if/when we get a better impl.
> This storage is dense (like field cache), so it's appropriate when the
> field occurs in all/most docs.  It's just like field cache, except the
> reading API is a get() method invocation, per document.
> Next step is to do basic integration with Lucene, and then compare
> sort performance of this vs field cache.
> For the "sort by String value" case, I think RAM usage & GC load of
> this index values API should be much better than field caache, since
> it does not create object per document (instead shares big long[] and
> byte[] across all docs), and because the values are stored in RAM as
> their UTF8 bytes.
> There are abstract Writer/Reader classes.  The current reader impls
> are entirely RAM resident (like field cache), but the API is (I think)
> agnostic, ie, one could make an MMAP impl instead.
> I think this is the first baby step towards LUCENE-1231.  Ie, it
> cannot yet update values, and the reading API is fully random-access
> by docID (like field cache), not like a posting list, though I
> do think we should add an iterator() api (to return flex's DocsEnum)
> -- eg I think this would be a good way to track avg doc/field length
> for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

Reply via email to