By default, documents get truncated at 10,000 terms (maybe there is
an off-by-one where it is going to 10,001 though?).
To increase this, and I always do, set the max field length on your
IndexWriter, and re-index. In 1.4.3, you set the maxFieldLength
variable of IndexWriter directly. We've changed this to be
setMaxFieldLength in the TRUNK codebase.
Erik
On 20 Nov 2005, at 20:16, <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]> wrote:
Hi. I was wondering if anyone else has seen this
before. I'm using lucene 1.4.3 and have indexed
about 3000 text documents using the statement:
doc.add(Field.Text("contents", new FileReader(f),
true));
When I go and retrieve the term frequency vectors, for
any document under about 90k, everything looks as
expected. However for larger documents (I haven't
found the exact point, but I know that those over 128k
qualify) the sum of the term frequencies in the vector
seems to max out at 10001. Here's the code snippet
that I'm using when I see this:
int vecSize = vector.size();
for (int j = 0; j < vecSize; j++) {
currentTermFreq =
vector.getTermFrequencies()[j];
sumTermFreq = currentTermFreq +
sumTermFreq;
if ( currentTermFreq > maxTermFreq) {
maxTermFreq = currentTermFreq;
}
}
The results in sumTermFreq winds up being 10001 for
large documents. The vector.size() varies from
document to document, the term with the highest
freqency (and that frequency) varies from document to
document, but not the sum.
Any thougths/suggestions would be appreciated.
Thanks
--MG
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]