Chris, I ended up hacking StandardTokenizer::next() to check for $^$^$, and if it is there then set the current Token PositionIncrement to 500 and resume the tokenizing loop (so the word which will be read into that Term will have position increment of 500). As far as I can tell it is working well - how can I check the terms positions in a document's field and see they have been incremented indeed? I have tried Luke, but it doesn't seem to allow this. My field is tokenized and not stored.
Itamar. -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Sunday, March 30, 2008 8:56 AM To: Lucene Users Subject: Re: setPositionIncrement questions : Breaking proximity data has been discussed several times before, and : concluded that setPositionIncrement is the way to go. In regards of it: : : 1. Where should it be called exactly to create the gap properly? any part of your Analyzer can set the position increment on any token to indicate how far after the previous token it should be. : 2. Is there a way to call it directly somehow while indexing (e.g. after : adding a new paragraph to an existing field) instead of appending $$$ : for example after the new string I'm indexing, and having to update my : tokenizer and filters so they will retain the $$$ chars, indicating the : gap request? if you add multiple Fields with the same name, Lucene will call the "getPositionIncrementGap" method on your Analyzer to determine how much positionIncreiment to put in between the last token of the first Field and the first token of the second Field -- so you could just pass each paragraph as a seperate Field instance .. alternately you can have a single Field instance, and your Analyzer can use whatever mechanims it wants to decide to set the position incriment to something high (a line break, a magic char sequence you put in the string, ... whatever you want) : 3. What is the recommended value to pass setPositionIncrement to create : a reasonable gap, and not risk large documents being indexed improperly : (I mean, is there some sort of high-bound for the position value?). MAX_INT .. pick gaps based on your data and the queries you expect (if you want gaps betwen paragraps, and your paragraphs tend to be under 200 words long, make the gap 500 so "lucene java"~300 can find those words in the same paragram, but can never span multiple paragraphs : 4. What are the consequences of setting PositionIncrement to 0? Does : this mean I can index synonyms or stems aside of the "real" words : without risking data corruption? it means the words appear at the same position - synonyms is a great example of this use case. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]