On Dec 20, 2007 11:57 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Yonik Seeley wrote:
> > On Dec 20, 2007 11:33 AM, Gabi Steinberg
> > <[EMAIL PROTECTED]> wrote:
> >> It might be a bit harsh to drop the document if it has a very long
> >> token
> >> in it.
> >
> > There is really two issues here.
> > For long tokens, one could either ignore them or generate an
> > exception.
>
> I can see the argument both ways.

Me too.

>  On the one hand, we want indexing
> to be robust/resilient, such that massive terms are quietly skipped
> (maybe w/ a log to infoStream if its set).
>
> On the other hand, clearly there is something seriously wrong when
> your analyzer is producing a single 16+ KB term, and so it would be
> nice to be brittle/in-your-face so the user is forced to deal with/
> correct the situation.
>
> Also, it's really bad once these terms pollute your index.  EG
> suddenly the Terminfos index can easily take tremendous amounts of
> RAM, slow down indexing/merging/searching, etc.  This is why
> LUCENE-1052 was created.  It's alot better if you catch this up front
> then letting it pollute your index.
>
> If we want to take the "in your face" solution, I think the cutoff
> should be less than 16 KB (16 KB is just the hard limit inside DW).
>
> > For all exceptions generated while indexing a document (that are
> > passed through to the user)
> > it seems like that document should not be in the index.
>
> I like this disposition because it means the index is in a known
> state.  It's bad to have partial docs in the index: it can only lead
> to more confusion as people try to figure out why some terms work for
> retrieving the doc but others don't.

Right... and I think that was the behavior before the indexing code
was rewritten since the new single doc segment was only added after
the complete document was inverted (hence any exception would prevent
it from being added).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to