[jira] Commented: (LUCENE-1313) Realtime Search

Michael McCandless (JIRA) Sat, 02 May 2009 03:51:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705255#action_12705255
 ]


Michael McCandless commented on LUCENE-1313:
--------------------------------------------


{quote}
IndexFileDeleter takes into account the ram directory (which
when using NRT with the FSD caused files to not be found).
{quote}

I don't like how "deep" the dichotomy of "RAMDir vs FSDir" is being
pushed.  Why can't we push FSD down to all these places (IFD,
SegmentInfo/s, etc.)?

{quote}
FSD is included and writes fdx, fdt, tvx, tvf, tvd extension
files to the primary directory (which is the same as
IW.directory). LUCENE-1618 needs to be updated with these
changes (or we simply include it in this patch as the
LUCENE-1618 patch is only a couple of files).
{quote}

Why did this require changes to FSD?

{quote}
I think we need to give the option of a ram mergescheduler
because the user may want not want the ram merging and disk
merging to compete for threads. I'm thinking if of the use case
where NRT is a priority then one may allocate more threads to
the ram CMS and less to the disk CMS. This also gives us the
option of trying out more parameters when performing benchmarks
of NRT.
{quote}

I think we're unlikely to gain from more than 1 BG thread for RAM
merging?  But I agree it'd be horrible if CMS blocked RAM merging
because its allotted threads were tied up merging disk segments.
Could we simply make the single CMS instance smart enoguh to realize
that a single RAM merge is allowed to proceed regardless of the thread
limit?

{quote}
We may want to default the ram mergepolicy to not use compound
files as it's not useful when using a ram dir?
{quote}

I think actually hardwire it, not just default.  Building CFS in RAM
makes no sense.  Worse, if we allow one to choose to do it we then
have to fix FSD to understand CFX must go to the dir too, and, we'd
have to fix IW to not merge in the doc store files when building a
private CFS.  Net/net I think we should not allow CFS for the RAM
segments.

On merging to disk it can then respect the user's CFS setting.

{quote}
Because FSD uses IW.directory, FSD will list files that
originated from FSD and from IW.directory, we may want to keep
track of which files are supposed to be in FSD (from the
underlying primary dir) and which are not?
{quote}

I don't understand what's wrong here?

{quote}
> If NRT is never used, the behavior of IW should be
> unchanged (which is not the case w/ this patch I think). RAMDir
> should be created the first time a flush is done due to NRT
> creation.

In the patch if ramdir is not passed in, the behavior of IW
remains the same as it is today. You're saying we should have IW
create the ramdir by default after getReader is called and
remove the IW ramdir constructor?
{quote}

Right.  This should be "under the hood".

{quote}
What if the user has an alternative ramdir implementation they want to
use?
{quote}

I think I'd rather not open up that option just yet.  This really is a
private optimization to how IW uses RAM.  We may want to further
change/improve how RAM is used.

Way back when, IW used a RAMDir internally for buffering; then, with
LUCENE-843 we switched to whole different format (DW's ram
buffering).  Now we are adding back RAMDir for NRT; maybe we'll switch
its format at some point... or change NRT to directly search DW's
RAM... etc.  How IW uses RAM is very much an internal detail so I'd
rather not expose it publically.

[BTW: once we have this machinery online, it's conceivable that we'd
want to flush to RAMDir even in the non-NRT case.  EG, say DW's RAM
buffer is full and it's time to flush.  If it flushes to RAM,
typically the RAMDir is far more compact than DW's RAM buffer and it
then still has some more space to work with, before having to flush to
disk.  If we explore this it should be in a new issue (later)...]

{quote}
> StoredFieldsWriter & TermVectorsTermsWriter now writes to
> IndexWriter.getFlushDirectory(), which is confusing because that
> method returns the RAMDir if set? Shouldn't this be the
> opposite? (Ie it should flush to IndexWriter.getDirectory()? Or
> we should change getFlushDiretory to NOT return the
> ramdir?)

The attached patch uses FileSwitchDirectory, where these files
are written to the primary directory (IW.directory). So
getFlushDirectory is ok?
{quote}

OK, though I'd like to simply always use FSD, even if primary &
secondary are the same dir.  All these if's checking for both dirs,
passing both dirs deep into Lucene's APIs, etc., are spooky.

{quote}
> Why did you need to add synchronized to some of the
> SegmentInfo files methods? (What breaks if you undo that?). The
> contract here is IW protects access to SegmentInfo/s

SegmentInfo.files was being cleared while sizeInBytes was called
which resulted in an NPE. The alternative is sync IW in
IW.size(SegmentInfos) which seems a bit extreme just to obtain
the size of a segment info?
{quote}

But... why did we have one thread asking for size while another was
tweaking the SegmentInfo?  What leads to that?  We need to better
understand the root cause here.

The size consumed by the RAM segments should be carefully computed
(called only in sychchronized(iw) context) and then shared. This value
changes relatively rarely (on flushing a new segment to ram; on
applying deletes that include RAM segments; on doing a ram->ram
merge), but is read frequently (per doc added, to decide whether it's
time to flush).  I think the value should be pushed to DW whenever it
changes, via synchronized method in DW; and then the existing
synchronized logic in DW that decides if it's time to "flush after"
should consult that value.  No further synchronizing should be
necessary.

Also, this ram size should be used not only for deciding when it's
time to merge to a disk segment, but also when it's time for DW to
flush a new segment (which I think your current patch is missing?).

{quote}
> The MergePolicy needs some smarts when it's dealing w/ RAM. EG it
> should not do a merge of more than XXX% of total RAM usage (should
> flush to the real directory instead).

Isn't this handled well enough in updatePendingMerges or is
there more that needs to be done?
{quote}

There is more that needs to be done, because MergePolicy must
conditionalize its logic based on RAM vs FS.  Ie, if our RAM buffer is
32 MB, and there are say 31 MB of RAM segments that suddenly need
merging (becuase we just flushed the 10th RAM segment), we should not
do a RAM -> RAM merge at that point (because 31./32. = very high pctg
of our net RAM buffer).  Instead we should force RAM -> disk at that
point, even though technically RAM is not yet full.

Ooh: maybe a better approach is to disallow the merge if the expected
peak RAM usage will exceed our buffer.  I like this better.  So if
budget is 32 MB, and net RAM used (segments + DW) is say 22, we have a
10 MB "budget", so we are allowed to select merges that total to < 10
MB.


> Realtime Search
> ---------------
>
>                 Key: LUCENE-1313
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1313
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.4.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1313) Realtime Search

Reply via email to