On Friday 05 December 2003 18:26, Doug Cutting wrote: ... > > Speaking of sentence/paragraph boundaries: Following is a proposal I > just wrote. It also includes the PhraseQuery change I alluded to in my > message yesterday to Erik. If this proposal is accepted, then I'll have > funding to develop it ASAP. > > Does this sound like a reasonable approach to the problem? > > It adds no costs for folks who don't use the new features. When folks > do use it, their index gets larger, but sentence- and paragraph- and > section-type searches are exact and as efficient as ordinary phrase > searches. > > Using position-increment is still possible: one can, e.g., increment by > 100 for sentence, 1000 for paragraph and 10,000 for section. This does > not increase index size nearly as much, but searches are not exact, and > are slightly slower, since they must specify non-zero slop (slop=0 > queries use a faster algorithm). > > Which approach is preferred? Would it be confusing to support both?
I'd prefer to have support for both. To avoid confusion the tradeoffs (time/space) could be mentioned in the API docs. The main advantage is that it allows for configurable distance search at the speed of current phrase search. For me, that is definitely worth adding. > > Note that the relative term position in PhraseQuery feature is not > dependent on the levels feature, and could be easily added right now. > > I hope this is not too confusing... Not at all. The only thing that I don't understand good enough is how the search for normal phrases (with and without slop and truncation) is done, but I guess there is only one way to find out: the source code and some experimentation. > Doug > > ------ > > MATCH WITHIN SENTENCE, PARAGRAPH, SECTION, ETC. > > The token class has a field called "type", which is used by analyzers > to distinguish lexical types, but is currently ignored when indexing. > We change this so that token types can name "levels". Each token type > can be declared to define a unique level. There is also a default > level for all undeclared token types. When indexing a token stream, a > position counter is maintained for each level. Tokens at non-default > levels are not indexed, and simply increment their counter. Tokens at > the default level are both indexed and increment the default counter. > > Consider the following example. > > Types "sent" and "para" are declared as levels. Types "word" and > "abbr" are undeclared, and hence default. > > The first two lines below are a sample sequence of token texts and > types. The next two lines are the values of the counters when > indexing this sequence of tokens. > > Tokens: > > text: The start . The USA . The end . > type: word word sent word abbr para sent word word sent > > Counters: > > default: 0 1 1 2 3 3 3 4 5 5 > sent: 0 0 0 1 1 1 1 2 2 2 > para: 0 0 0 0 0 0 1 1 1 1 > > Thus the word "the" appears at default level positions 0, 2, and 4, > and at "sent" level positions 0, 1, and 2, and at para level positions > 0, 0, and 1. I'm trying to let this get through to me: every token would have a one or more positions at each level. That's quite general, and could be very powerful. > > The following method is used to declare a level: > > IndexWriter.addLevel(String field, String type); As the type of level is not a number, am I correct to assume that the levels are not necessarily hierarchical? Eg. in the example would it be possible that the sentence nr is not increased but the paragraph nr is increased at a token? I would certainly not mind having the possibities that this non strict hierarchy opens, eg. addition of more linguistic information in the index. A level could function as an attribute for a token, eg. noun, verb etc.. However, this might also require a proximity search over different levels. I'd like to hear some comment from a linguist on this, as it might help to get better results in searching for longer phrases in which the level information would be provided by a document analyzer and a query parser. > > Note that this must be performed each time an IndexWriter is > constructed. If it is performed when some documents are added but not > when others are, or if indexes are merged with different positioned > tokens, then all positions will be zero for levels undeclared when > indexing. > > An index maintains, for each field, the known levels, accessed through: > > public String[] IndexReader.positionedLevels(String field); > > The stream of positions at each level is accessed with: > > public TermPositions IndexReader.getPositions(Term term, String level); > > Note that the index size increases substantially for each level added. > Each additional level adds around 50% to the default-only index size. > Thus an index with two additional levels would require around 200% the > size of an index with no additional levels. > > PhraseQuery's constructor takes a level for the match: > > public PhraseQuery(String level); > > Phrase queries are also permitted to explicitly specify the position > of each term in the phrase. For example, if two phrase query terms > are both at position=0, with phrase slop=0, then they must occur at > exactly the same positions. And if they're at position=0 and > position=1 respectively, with phrase slop=0, then they must occur > adjacently, as an exact phrase. > > public PhraseQuery.add(Term term, int position); > > If, for example, sentence boundaries are at level=sent, then a phrase > query at level=sent with slop=0 and position=0 for all terms would > require that all query terms are in the same sentence. Would there be a way to relate the scores of two phrase queries done at different levels? Eg. for a query that requires two terms to be in the same sentence, one could use half of the avarage sentence length as the edit distance in sloppyFreq(), but this would require storing the nr of sentences, ie. sth like the sentence norm. I guess this question boils down to whether a norm would be stored for each level. Also, for non strict hierarchical levels, the norm would not be a trivial thing. And how about scoring for interlevel search, if that would be possible? Interlevel search might also require that all levels for a document are stored close to each other in the proximity index. From your earlier remark: > It adds no costs for folks who don't use the new features. When folks > do use it, their index gets larger, but sentence- and paragraph- and > section-type searches are exact and as efficient as ordinary phrase > searches. I got the impression that each level would be stored separately. However, for levels that are more attribute like (eg. noun/verb) it might be better to give them the same norm, and store them close to each other in order to allow interlevel search. Anyway, I think this multilevel indexing is a great idea even for strict hierarchical levels, but these don't need a name, a level number (or even only the number of levels) will do. Kind regards, Ype Kingma --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]