Alex McManus writes:
>
> > Maybe your fields are too long so that only part of it gets indexed (look
> at IndexWriter.maxFieldLength).
>
> This is interesting, I've had a look at the JavaDoc and I think I
> understand. The maximum field length describes the maximum number of unique
> terms, not the maximum number of words/tokens. Therefore, even if I have a
> 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words
> which should safely handle the maximum number of unique words, rather than
> 800 million which would be needed to handle every token.
>
> Is this correct?
A short look at the source says no.
maxFieldLength is handed to DocumentWriter where one finds
TokenStream stream = analyzer.tokenStream(fieldName, reader);
try {
for (Token t = stream.next(); t != null; t = stream.next()) {
position += (t.getPositionIncrement() - 1);
addPosition(fieldName, t.termText(), position++);
if (++length > maxFieldLength) break;
}
} finally {
stream.close();
}
so it's the number of terms not the number of different tokens.
>
> Is 100k a worrying maxFieldLength, in terms of how much memory this would
> consume?
>
Depends on the size of your documents ;-)
I use 250000 without problems, but my documents are not as big (<40000
tokens). I just want to make sure, not to loose any text for indexing.
> Does Lucene issue a warning if this limit is exceeded during indexing (it
> would be quite worrying if it was silently discarding terms)?
>
no.
I guess the idea behind this limit is, that the relevant terms should occur
in the first n words and indexing the rest just increases index size.
Morus
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]