Re: Lucene in-memory index

Igor Shalyminov Fri, 18 Oct 2013 10:21:08 -0700

Hello!

OK, it turns out that DirectPostingsFormat is really an extreme thing: 8GB of 
index couldn't fit into 20+ java heap.
I wonder if there is a postings format that works from disk the standard way 
but uses no compression?



-- 
Best Regards,
Igor

18.10.2013, 02:06, "Igor Shalyminov" <[email protected]>:
> Mike,
>
> For now I'm using just a SpanQuery over a ~600MB index segment 
> single-threadedly (one segment - one thread, the complete setup is 30 
> segments with the total of 20GB).
>
> I'm trying to use Lucene for the morphologically annotated text corpus 
> (namely, Russian National Corpus).
> The main query type in it is co-occurrence search with desired word 
> morphological features and distance between tokens.
>
> In my test case I work with a single field - grammar (it is word-level - 
> every word in the corpus has one). Full grammar annotation of a word is a set 
> of atomic grammar features.
> For an example, the verb "book" has in its grammar:
> - POS  tag (V);
> - time (pres);
>
> and the noun "book":
> - POS tag (N)
> - number (sg).
>
> In general one grammar annotation has approximately 8 atomic features.
>
> Words are treated as initially ambiguous, so that for the word "book" 
> occurrence in the text we get grammar tokens:
> V    pres    N    sg
> 2 parses: "V,pres" and "N,sg" are just independent tokens with 
> positionIncrement=0 in the index.
>
> Moreover, each such token has parse bitmask in its payload:
> V|0001    pres|0001    N|0010    sg|0010
>
> Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the 
> maximum of 4 parse variants. It allows me to find the word "book" for the 
> query "V" & "pres" but not for the query "V" & "sg".
>
> So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} 
> with position and payload checking over a 600MB segment and getting the 
> precise doc hits number and overall matches number via iterating over 
> getSpans().
>
> This takes me about 20 seconds, even if everything is in RAM.
> The next thing I'm going to explore is compression, I'll try 
> DirectPostingsFormat as you suggested.
>
> --
> Best Regards,
> Igor
>
> 17.10.2013, 20:26, "Michael McCandless" <[email protected]>:
>
>>  DirectPostingsFormat holds all postings in RAM, uncompressed, as
>>  simple java arrays.  But it's quite RAM heavy...
>>
>>  The hotspots may also be in the queries you are running ... maybe you
>>  can describe more how you're using Lucene?
>>
>>  Mike McCandless
>>
>>  http://blog.mikemccandless.com
>>
>>  On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
>>  <[email protected]> wrote:
>>>   Hello!
>>>
>>>   I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. 
>>> Both work the same for me (the same bad:( ).
>>>   Thus, I think my problem is not disk access (although I always see 
>>> getPayload() in the VisualVM top).
>>>   So, maybe the hard part in the postings traversal is decompression?
>>>   Are there Lucene codecs which use light postings compression (maybe none 
>>> at all)?
>>>
>>>   And, getting back to in-memory index topic, is lucene.codecs.memory 
>>> somewhat similar to RAMDirectory?
>>>
>>>   --
>>>   Best Regards,
>>>   Igor
>>>
>>>   10.10.2013, 03:01, "Vitaly Funstein" <[email protected]>:
>>>>   I don't think you want to load indexes of this size into a RAMDirectory.
>>>>   The reasons have been listed multiple times here... in short, just use
>>>>   MMapDirectory.
>>>>
>>>>   On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>>   <[email protected]>wrote:
>>>>>    Hello!
>>>>>
>>>>>    I need to perform an experiment of loading the entire index in RAM and
>>>>>    seeing how the search performance changes.
>>>>>    My index has TermVectors with payload and position info, StoredFields, 
>>>>> and
>>>>>    DocValues. It takes ~30GB on disk (the server has 48).
>>>>>
>>>>>    _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>>    File(_indexDirectory)));
>>>>>
>>>>>    Is the line above the only thing I have to do to complete my goal?
>>>>>
>>>>>    And also:
>>>>>    - will all the data be loaded in the RAM right after opening, or during
>>>>>    the reading stage?
>>>>>    - will the index data be stored in RAM as it is on disk, or will it be
>>>>>    uncompressed first?
>>>>>
>>>>>    --
>>>>>    Best Regards,
>>>>>    Igor
>>>>>
>>>>>    ---------------------------------------------------------------------
>>>>>    To unsubscribe, e-mail: [email protected]
>>>>>    For additional commands, e-mail: [email protected]
>>>   ---------------------------------------------------------------------
>>>   To unsubscribe, e-mail: [email protected]
>>>   For additional commands, e-mail: [email protected]
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: [email protected]
>>  For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene in-memory index

Reply via email to