: : Here's the root question: "Am I reasonably safe, for a single document, in : thinking of indexing multiple chunks with the same field as being identical, : for all practical purposes, with indexing the field once with all the chunks : concatenated together?".
esentially that's true -- the differnce is in the way the positionIncrimentGap method of your analyzer is used. For a single large field value, it is never used. for two or more values, it's called in between each value. : What surprised me a bit is that SpanQueries work just fine this way. If I : create a span query for "two" and "five", this doc is found for some slop : factors and not found for other slop factors, just as though I indexed the : "tokens" field once with "one two three four five six". right ... that's where the positionIncrimentGap can come in handy .. you can introduce a "large" gap size so that you can make phrase/span queries which match across multiple values, and others which don't. : So, are there any "gotchas" that spring to mind with the notion of chunking : the input to < 10,000 words and indexing the chunks multiple times in the : same field? Let me be clear I'm just beginning to design this, so all I'm well, there's really nothing wrong with using multiple values, as far as doing that to deal with the 10,000 terms limit... a) if you don't wnat the limit, change the limit -- there's no reason to work arround it. b) i'm not entirely sure if that limit is on a single field value, or on the total number of indexed tokens for that field name -- in which case this approach doesn't work arround it at all -- make sure you test with two fields whose total number of tokens is bigger then 10,000 -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]