I am trying to adopt lucene for a special IR system. The following scenario is an approximation of what I am trying to do. Please bear with me if some things doesnt make sense. I need some suggestions on formulating queries for the following scenario
Each document consists of a set of fields (standard in lucene). But in my case, the field is somewhat different as explained below. Field: --------- Each field consists of a set of conceptual sections. Each of these sections is separated by say N (say 1000) index positions but are in the same field. Sizes of sections vary and do not have any lower or upper bound on the number of terms they may contain . Ex: Lets say Field "contents" has <section1 of 100 terms><gap of 1000 term positions><section 2 of 1500 terms><gap of 1000 term positions><gap of 1000 term positions><section 3 of 10 terms> NOTE: At index time, I am assuming I somehow know how to form these sections. Typical Query: --------------------- Consists of 15 to 30 query terms. In other words, these query terms represent a conceptual section. Aim of the Query formation: ---------------------------------------- I want to rank the documents proportional to the number query terms appearing in the SAME SECTION and IN ORDER. Documents containing terms with the My Questions: --------------------- Considering the structure of the fields/documents and the number of query terms. (1) Is there an effective way of formulating a query with the existing query types in Lucene? (2) After considering the way different queries work and their limitations, I think forming phrase/span queries of groups of query terms might approximate the rankings I am expecting. In that case which of the following queries will perform better (in terms of QUERY SPEED and RANKING) (a) phrase query with certain slope factor (b) span query Thanks, Rajesh Munavalli