>   Document doc = new Document()
>   for (int i = 0; i < pages.length; i++) {
>     doc.add(new Field("text", pages[i], Field.Store.NO, 
> Field.Index.TOKENIZED));
>     doc.add(new Field("text", "$$", Field.Store.NO, 
> Field.Index.UN_TOKENIZED));
>   }

UN_TOKENIZED. Nice idea!
I will check this out.

> 2) if your goal is just to be able to make sure you can query 
> for phrases 
> without crossing page boundaries, it's a lot simpler just to use are 
> really big positionIncimentGap with your analyzer (and add 
> each page as a 
> seperate Field instance).  boundary tokens like these are relaly only 
> neccessary if you want more complex queries (like "find X and Y on 
> the same page but not in the same sentence")

Hm. This is what Erik already recommended.
I had to store the field with TermVector.WITH_POSITIONS, right?

But I do not know the maximum number of terms per page and I do not know the
maximum number of pages.
I already had documents with more than 50.000 pages (A4) and documents with
1 page but 100 MB data.
How many terms can 100 MB have? Hm...
Since positions are stored as int I could have a maximum of 40.000 terms per
page (50.000 pages * 40.000 term -> nearly Integer.MAX_VALUE).



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to