> When I go and retrieve the term frequency vectors, for
> any document under about 90k, everything looks as
> expected.  However for larger documents (I haven't
> found the exact point, but I know that those over 128k
> qualify) the sum of the term frequencies in the vector
> seems to max out at 10001.  Here's the code snippet
> that I'm using when I see this:

That's probably because there is a limit built into Lucene where it ignores any 
tokens in a field past the first 10,000.  There is a property you can set to 
increase this limit.  I dont' have the source in front of me right now, but if 
you go into the index subdirectory of the Lucene source and grep for 10000, you 
should find it.  Let's say for purpose of argument that the name of the 
property is "maxTokens".  Then you could just do this:

java -Dorg.apache.lucene.maxTokens=100000" yourapp ...

To get a higher limit.  Of course, you could also change the Lucene source file 
and recompile it.  Note that you CANNOT just set the property in your code, in 
general, as the Lucene class puts it into a static final int, meaning it 
examines the value of the property (once) at class load time.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to