On Wednesday 22 February 2006 00:45, Rajesh Munavalli wrote: > I am trying to adopt lucene for a special IR system. The following scenario > is an approximation of what I am trying to do. Please bear with me if some > things doesnt make sense. I need some suggestions on formulating queries for > the following scenario > > Each document consists of a set of fields (standard in lucene). But in my > case, the field is somewhat different as explained below. > > Field: > --------- > Each field consists of a set of conceptual sections. Each of these sections > is separated by say N (say 1000) index positions but are in the same field. > Sizes of sections vary and do not have any lower or upper bound on the > number of terms they may contain > . > Ex: Lets say Field "contents" has > <section1 of 100 terms><gap of 1000 term positions><section 2 of 1500 > terms><gap of 1000 term positions><gap of 1000 term positions><section 3 of > 10 terms> > > NOTE: At index time, I am assuming I somehow know how to form these > sections.
One more choice you have is too index both the full document and each section as a Lucene document. > > Typical Query: > --------------------- > Consists of 15 to 30 query terms. In other words, these query terms > represent a conceptual section. Would you need synonyms of these terms, too? > Aim of the Query formation: > ---------------------------------------- > I want to rank the documents proportional to the number query terms For this there is the coord() factor used in Lucene boolean queries. But scoring exactly proportional to the number of query terms is difficult to do because the lucene score is not bound by default. > appearing in the SAME SECTION and IN ORDER. Documents containing terms with To query the exact order, you can use PhraseQuery and SpanQuery. > the > > My Questions: > --------------------- > Considering the structure of the fields/documents and the number of query > terms. > > (1) Is there an effective way of formulating a query with the existing query > types in Lucene? I don't think so, see below. > (2) After considering the way different queries work and their limitations, > I think forming phrase/span queries of groups of query terms > might approximate the rankings I am expecting. In that case which of the > following queries will perform better (in terms of QUERY SPEED and RANKING) > (a) phrase query with certain slope factor > (b) span query SpanQuery is slower than PhraseQuery, but it has the advantage that it can be nested. Nesting here means the possibility to use eg. a short phrase as a unit to be matched and scored. Concerning this: >Rank 2: Documents containing section containing all terms but randomly >ordered SpanQuery can also match unordered occurrences, I don't know about PhraseQuery. To formulate a single query for your requirements, there is still the problem that PhraseQuery and SpanQuery only work when all their "terms" are present in an indexed lucene document field. Putting it differently, when fewer terms present, their order cannot be taken into account, unless the query contains an (non)ordered query specifying a subset of the terms present in the documents. An alternative to the current span query implementation is here: http://issues.apache.org/jira/browse/LUCENE-413 but this will only help to get an impression of how to match in the ordered and unordered cases. It might be possible to generalize the various span algorithms there and in the trunk to work with fewer "terms". Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]