Re: Design questions

Chris Hostetter Thu, 14 Feb 2008 23:03:54 -0800

I haven't really been following this thread that closely, but...

: Why not just use $$$$$$$$? Check to insure that it makes
: it through whatever analyzer you choose though. For instance,
: LetterTokenizer will remove it...


1) i'm 99% sure you can do something like this...

  Document doc = new Document()
  for (int i = 0; i < pages.length; i++) {
    doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED));
    doc.add(new Field("text", "$$", Field.Store.NO, Field.Index.UN_TOKENIZED));
  }

...and you'll get your magic token regardless of whether it would normally 
make it through your analyzer. In fact: you want it to be something your 
analyzer could never produce, even if it appears in the orriginal text, so 
you don't get false boundaries (ie: if you use an Analzeer that lowercases 
everything, then "A" makes a perfectly fine boundary token.

2) if your goal is just to be able to make sure you can query for phrases 
without crossing page boundaries, it's a lot simpler just to use are 
really big positionIncimentGap with your analyzer (and add each page as a 
seperate Field instance).  boundary tokens like these are relaly only 
neccessary if you want more complex queries (like "find X and Y on 
the same page but not in the same sentence")




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Design questions

Reply via email to