RE: multivalue fields

Morus Walter Mon, 17 May 2004 03:19:55 -0700

Alex McManus writes:
> 
> > Maybe your fields are too long so that only part of it gets indexed (look
> at IndexWriter.maxFieldLength).
> 
> This is interesting, I've had a look at the JavaDoc and I think I
> understand. The maximum field length describes the maximum number of unique
> terms, not the maximum number of words/tokens. Therefore, even if I have a
> 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words
> which should safely handle the maximum number of unique words, rather than
> 800 million which would be needed to handle every token.
> 
> Is this correct?


A short look at the source says no.

maxFieldLength is handed to DocumentWriter where one finds

          TokenStream stream = analyzer.tokenStream(fieldName, reader);
          try {
            for (Token t = stream.next(); t != null; t = stream.next()) {
              position += (t.getPositionIncrement() - 1);
              addPosition(fieldName, t.termText(), position++);
              if (++length > maxFieldLength) break;
            }
          } finally {
            stream.close();
          }

so it's the number of terms not the number of different tokens.

> 
> Is 100k a worrying maxFieldLength, in terms of how much memory this would
> consume?
> 
Depends on the size of your documents ;-)
I use 250000 without problems, but my documents are not as big (<40000
tokens). I just want to make sure, not to loose any text for indexing.

> Does Lucene issue a warning if this limit is exceeded during indexing (it
> would be quite worrying if it was silently discarding terms)?
> 
no.
I guess the idea behind this limit is, that the relevant terms should occur
in the first n words and indexing the rest just increases index size.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: multivalue fields

Reply via email to