> When I go and retrieve the term frequency vectors, for > any document under about 90k, everything looks as > expected. However for larger documents (I haven't > found the exact point, but I know that those over 128k > qualify) the sum of the term frequencies in the vector > seems to max out at 10001. Here's the code snippet > that I'm using when I see this:
That's probably because there is a limit built into Lucene where it ignores any tokens in a field past the first 10,000. There is a property you can set to increase this limit. I dont' have the source in front of me right now, but if you go into the index subdirectory of the Lucene source and grep for 10000, you should find it. Let's say for purpose of argument that the name of the property is "maxTokens". Then you could just do this: java -Dorg.apache.lucene.maxTokens=100000" yourapp ... To get a higher limit. Of course, you could also change the Lucene source file and recompile it. Note that you CANNOT just set the property in your code, in general, as the Lucene class puts it into a static final int, meaning it examines the value of the property (once) at class load time. Good luck! --MDC --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]