Re: Beyond Lucene 2.0 Index Design

Ming Lei Wed, 10 Jan 2007 15:15:02 -0800

A little bit correction:
Impact does not have to be per occurrence of a term,
but rather most likely per aggregation of all
occurrences of a term in a document (per pair of term
and doc). Thus you just aggregate the significance of
occurrences in different regions of a doc at index
time and put the aggregated significance into
"impact". Then you can do away fields in a
vector-space model of retreival.


But there is usually some semantics of fields in a
boolean model and semi-structured information
retrieval, which you can not get rid of.

Michael


--- Ming Lei <[EMAIL PROTECTED]> wrote:

> Just my two cents,
> I think what he meant by "single field" is the
> following:
> 
> If the concept of "field" was introduced to
> differentiate the significance of term occurrences
> in
> difference regions of a document, (eg, the occurence
> in title is more important than in body, etc), that
> significance can be alternatively represented be (or
> at least encoded in) "impact" which is per occurence
> of a term.
> For example, if a base "impact" value for a term
> occurence is 1.0, you can assign an additional "0.5"
> to the occurence in title, thus you would have the a
> impace of "1.5" for the title occurrence of that
> term,
> while "1.0" for the body occurrence.
> 
> Does this make sense to you?
> 
> I feel that people should take a look at the
> theoretical retrieval model first. It is not clear
> to
> me that Lucene is fully vector-space model. It seems
> that so much assumption has been made about the
> context of the discussion.
> 
> Michael
> 
> --- jian chen <[EMAIL PROTECTED]> wrote:
> 
> > Hi, Jeff,
> > 
> > I like the idea of impact based scoring. However,
> > could you elaborate more
> > on why we only need to use single field at search 
> > time?
> > 
> > In Lucene, the indexed terms are field specific,
> and
> > two terms, even if they
> > are the same, are still different terms if they
> are
> > of different fields.
> > 
> > So,  I think the multiple field scenario is still
> > needed right? What if the
> > user wants to search on both subject and content
> for
> > emails, for example,
> > and sometimes, only wants to search on subject,
> this
> > type of tasks, without
> > multiple fields, how this would be handled.
> > 
> > I got lost on this,  could any one educate?
> > 
> > Thanks,
> > 
> > Jian
> > 
> > On 1/9/07, Dalton, Jeffery
> <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > I'm not sure we fully understand one another,
> but
> > I'll try to explain
> > > what I am thinking.
> > >
> > > Yes, it has use after sorting.  It is used at
> > query time for document
> > > scoring in place of the TF and length norm
> > components  (new scorers
> > > would need to be created).
> > >
> > > Using an impact based index moves most of the
> > scoring from query time to
> > > index time (trades query time flexibility for
> > greatly improved query
> > > search performance).  Because the field boosts,
> > length norm, position
> > > boosts, etc... are incorporated into a single
> > document-term-score, you
> > > can use a single field at search time.  It
> allows
> > one posting list per
> > > query term instead of the current one posting
> list
> > per field per query
> > > term (MultiFieldQueryParser wouldn't be
> necessary
> > in most cases).  In
> > > addition to having fewer posting lists to
> examine,
> > you often don't need
> > > to read to the end of long posting lists when
> > processing with a
> > > score-at-a-time approach (see Anh/Moffat's
> Pruned
> > Query Evaluation Using
> > > Pre-Computed Impacts, SIGIR 2006) for details on
> > one potential
> > > algorithm.
> > >
> > > I'm not quite sure what you mean when mention
> > leaving them out and
> > > re-calculating them at merge time.
> > >
> > > - Jeff
> > >
> > > > -----Original Message-----
> > > > From: Marvin Humphrey
> > [mailto:[EMAIL PROTECTED]
> > > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > > To: [email protected]
> > > > Subject: Re: Beyond Lucene 2.0 Index Design
> > > >
> > > >
> > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery
> > wrote:
> > > >
> > > > > e. <impact, num_docs, (doc1,...docN)>
> > > > > f. <impact, num_docs, ([doc1, freq
> > ,<positions>],...[docN, freq
> > > > > ,<positions>])
> > > >
> > > > Does the impact have any use after it's used
> to
> > sort the postings?
> > > > Can we leave it out of the index format and
> > recalculate at merge-time?
> > > >
> > > > Marvin Humphrey
> > > > Rectangular Research
> > > > http://www.rectangular.com/
> > > >
> > > >
> > > >
> > > >
> >
>
---------------------------------------------------------------------
> > > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> > >
> > 
> 
> 
> 
>  
>
____________________________________________________________________________________
> Need a quick answer? Get one in minutes from people
> who know.
> Ask your question on www.Answers.yahoo.com
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



 
____________________________________________________________________________________
Want to start your own business?
Learn how on Yahoo! Small Business.
http://smallbusiness.yahoo.com/r-index

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to