I haven't really been following this thread that closely, but...
: Why not just use $$$$$$$$? Check to insure that it makes
: it through whatever analyzer you choose though. For instance,
: LetterTokenizer will remove it...
1) i'm 99% sure you can do something like this...
Document doc = new Document()
for (int i = 0; i < pages.length; i++) {
doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("text", "$$", Field.Store.NO, Field.Index.UN_TOKENIZED));
}
...and you'll get your magic token regardless of whether it would normally
make it through your analyzer. In fact: you want it to be something your
analyzer could never produce, even if it appears in the orriginal text, so
you don't get false boundaries (ie: if you use an Analzeer that lowercases
everything, then "A" makes a perfectly fine boundary token.
2) if your goal is just to be able to make sure you can query for phrases
without crossing page boundaries, it's a lot simpler just to use are
really big positionIncimentGap with your analyzer (and add each page as a
seperate Field instance). boundary tokens like these are relaly only
neccessary if you want more complex queries (like "find X and Y on
the same page but not in the same sentence")
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]