Re: TermFrequencies vector limits?

Erik Hatcher Mon, 21 Nov 2005 00:06:49 -0800

By default, documents get truncated at 10,000 terms (maybe there isan off-by-one where it is going to 10,001 though?).

To increase this, and I always do, set the max field length on yourIndexWriter, and re-index. In 1.4.3, you set the maxFieldLengthvariable of IndexWriter directly. We've changed this to besetMaxFieldLength in the TRUNK codebase.


        Erik

On 20 Nov 2005, at 20:16, <[EMAIL PROTECTED]><[EMAIL PROTECTED]> wrote:

Hi.  I was wondering if anyone else has seen this
before.  I'm using  lucene 1.4.3 and have indexed
about 3000 text documents using the statement:

doc.add(Field.Text("contents", new FileReader(f),
true));

When I go and retrieve the term frequency vectors, for
any document under about 90k, everything looks as
expected.  However for larger documents (I haven't
found the exact point, but I know that those over 128k
qualify) the sum of the term frequencies in the vector
seems to max out at 10001.  Here's the code snippet
that I'm using when I see this:

        int vecSize = vector.size();
        for (int j = 0; j < vecSize; j++) {
            currentTermFreq =
vector.getTermFrequencies()[j];
            sumTermFreq = currentTermFreq +
sumTermFreq;
            if ( currentTermFreq > maxTermFreq) {
                maxTermFreq = currentTermFreq;
            }
        }

The results in sumTermFreq winds up being 10001 for
large documents.  The vector.size() varies from
document to document, the term with the highest
freqency (and that frequency) varies from document to
document, but not the sum.

Any thougths/suggestions would be appreciated.

Thanks
--MG




                
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TermFrequencies vector limits?

Reply via email to