Chris, Thanks for your input.
Please let me make sure that I get this right: while iterating through the words in a document, I can use my tokenizer to setPositionIncrement(150) on a specific token, what would make it be more distant from the previous token than it should have been. The next token will already have position increment of 1 and therefore will immediately follow that token, with no extra handling. If I get this right, the best way to achieve that is by appending a predefined string like $$$, such that will not occur accidently in my documents, and have my tokenizer set the position increment as well instead of just tokenizing upon it. >>> Lucene will call the "getPositionIncrementGap" method on your Analyzer to determine how much positionIncreiment to put in between the last token of the first Field and the first token of the second Field -- so you could just pass each paragraph as a seperate Field instance This sounds good, but is risky, since I will have to concatenate my paragraphs that I DO want to have proximity data in between, and if I forget to, or accidently don't do that this will corrupt proximity-based searches. My documents can become very big as well. I guess what I was looking for was a simpler way - say tell Lucene when I do doc.add(new Field) to set the position increment for the last token. The "magic char sequence" will do, but I was wondering if there is a way to do that without ammending my Tokenizer? >>> it means the words appear at the same position ... And what does this mean exactly? How can this affect standard searches? What I might do with this is store stems side-by-side with the original word. From what I've heard so far this is NOT how you do this for English texts - you rather store them in a different field, why is that? I thought if you store them side-by-side you could write a Scorer (or similar) that will return all relevant results for the stem of a given word, boosting words with the same exact syntax more than others. Any ideas on that? Itamar. -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Sunday, March 30, 2008 8:56 AM To: Lucene Users Subject: Re: setPositionIncrement questions : Breaking proximity data has been discussed several times before, and : concluded that setPositionIncrement is the way to go. In regards of it: : : 1. Where should it be called exactly to create the gap properly? any part of your Analyzer can set the position increment on any token to indicate how far after the previous token it should be. : 2. Is there a way to call it directly somehow while indexing (e.g. after : adding a new paragraph to an existing field) instead of appending $$$ : for example after the new string I'm indexing, and having to update my : tokenizer and filters so they will retain the $$$ chars, indicating the : gap request? if you add multiple Fields with the same name, Lucene will call the "getPositionIncrementGap" method on your Analyzer to determine how much positionIncreiment to put in between the last token of the first Field and the first token of the second Field -- so you could just pass each paragraph as a seperate Field instance .. alternately you can have a single Field instance, and your Analyzer can use whatever mechanims it wants to decide to set the position incriment to something high (a line break, a magic char sequence you put in the string, ... whatever you want) : 3. What is the recommended value to pass setPositionIncrement to create : a reasonable gap, and not risk large documents being indexed improperly : (I mean, is there some sort of high-bound for the position value?). MAX_INT .. pick gaps based on your data and the queries you expect (if you want gaps betwen paragraps, and your paragraphs tend to be under 200 words long, make the gap 500 so "lucene java"~300 can find those words in the same paragram, but can never span multiple paragraphs : 4. What are the consequences of setting PositionIncrement to 0? Does : this mean I can index synonyms or stems aside of the "real" words : without risking data corruption? it means the words appear at the same position - synonyms is a great example of this use case. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]