TermFrequencies vector limits?

marigoldcc Sun, 20 Nov 2005 23:53:29 -0800

Hi.  I was wondering if anyone else has seen this
before.  I'm using  lucene 1.4.3 and have indexed
about 3000 text documents using the statement:


doc.add(Field.Text("contents", new FileReader(f),
true));

When I go and retrieve the term frequency vectors, for
any document under about 90k, everything looks as
expected.  However for larger documents (I haven't
found the exact point, but I know that those over 128k
qualify) the sum of the term frequencies in the vector
seems to max out at 10001.  Here's the code snippet
that I'm using when I see this:

        int vecSize = vector.size();
        for (int j = 0; j < vecSize; j++) {
            currentTermFreq =
vector.getTermFrequencies()[j];
            sumTermFreq = currentTermFreq +
sumTermFreq;
            if ( currentTermFreq > maxTermFreq) {
                maxTermFreq = currentTermFreq;
            }
        }

The results in sumTermFreq winds up being 10001 for
large documents.  The vector.size() varies from
document to document, the term with the highest
freqency (and that frequency) varies from document to
document, but not the sum.

Any thougths/suggestions would be appreciated.

Thanks
--MG




                
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TermFrequencies vector limits?

Reply via email to