Re: Index without tf, anyone?

eks dev Fri, 18 Jul 2008 13:38:32 -0700

you are right here, I already changed my mind on this one, almost all terms I 
have are with tf = 1... would not make sense


but I will hard code tf to 1 in that case as it makes no damage and makes tf = 
0 problem goes away 



----- Original Message ----
> From: Michael McCandless <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Sent: Friday, 18 July, 2008 10:19:19 PM
> Subject: Re: Index without tf, anyone?
> 
> 
> You could do that, though, it's not as optimized because you'd use  
> fewer bytes if you directly encoded the docDelta (not 2*docDelta+1),  
> and you'd save some CPU when decoding as well.  But maybe first do it  
> this way, then if necessary/it helps/etc, explore the optimization?
> 
> Mike
> 
> eks dev wrote:
> 
> > am I boring :)
> >
> > would it be ok to assume tf == 1 always if we use omitTf? In that  
> > case docDelta remains odd and current index format interprets this  
> > as tf==1... if all terms have tf == 1 , relative score is factored  
> > out, so it makes no diference.
> >
> >
> > In that case, there is no need to change anything on reader side!
> >
> >
> > ----- Original Message ----
> >> From: eks dev 
> >> To: java-dev@lucene.apache.org
> >> Sent: Friday, 18 July, 2008 9:48:04 PM
> >> Subject: Re: Index without tf, anyone?
> >>
> >> also, another one:
> >>
> >> what should happen with payloads and omitTf options in case
> >> op
> >> storePayloads==true && omitTf==true
> >> shold we say:
> >> 1. ignore omitTf and go on with payloads
> >> or
> >> 2. disable payloads  and omit tf
> >>
> >> other combination are clear
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: eks dev
> >>> To: java-dev@lucene.apache.org
> >>> Sent: Friday, 18 July, 2008 9:20:09 PM
> >>> Subject: Re: Index without tf, anyone?
> >>>
> >>> Mike,
> >>> I have started playing with this, holly cow.... it is a lot of code
> >>>
> >>> Question
> >>>
> >>> SegmentMerger. mergeFields()... there is a big block
> >>>
> >>> else {
> >>>        addIndexed(reader, fieldInfos,
> >>> reader 
> >>> .getFieldNames 
> >>> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET),
> >>
> >>> true, true, true, false);
> >>>        addIndexed(reader, fieldInfos,
> >>> reader 
> >>> .getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION),  
> >>> true,
> >>> true, false, false);
> >>>        addIndexed(reader, fieldInfos,
> >>> reader 
> >>> .getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET),  
> >>> true,
> >>> false, true, false);
> >>>        addIndexed(reader, fieldInfos,
> >>> reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR), true,  
> >>> false, false,
> >>> false);
> >>>        addIndexed(reader, fieldInfos,
> >>> reader.getFieldNames(IndexReader.FieldOption.STORES_PAYLOADS),  
> >>> false, false,
> >>> false, true);
> >>>        addIndexed(reader, fieldInfos,
> >>> reader.getFieldNames(IndexReader.FieldOption.INDEXED), false,  
> >>> false, false,
> >>> false);
> >>>
> >> fieldInfos 
> >> .add(reader.getFieldNames(IndexReader.FieldOption.UNINDEXED),
> >>> false);
> >>>      }
> >>>
> >>>
> >>> I simply do not understand it, have changed addIndexed(...)  
> >>> signature to
> >> include
> >>> omitTf, but I am sure what needs to be done here?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ----- Original Message ----
> >>>> From: Michael McCandless
> >>>> To: java-dev@lucene.apache.org
> >>>> Sent: Friday, 18 July, 2008 11:48:20 AM
> >>>> Subject: Re: Index without tf, anyone?
> >>>>
> >>>> I just committed LUCENE-1301, which is a first step (top down)  
> >>>> towards
> >>>> flexible indexing.  I hope I didn't break anything....
> >>>>
> >>>> While flexible indexing should make this simpler, it's not too  
> >>>> bad to
> >>>> modify Lucene to do this today, if you want.  I think this is what
> >>>> you'll need to do (but I haven't tested!):
> >>>>
> >>>>   * Add something to Fieldable/AbstractField/Field that "knows"
> >>>>     whether a field should store the tf.  Also add this to
> >>>>     FieldInfo.java, and make sure that bit is saved to the fnm  
> >>>> file.
> >>>>
> >>>>   * In the new oal.index.DocFieldProcessorPerThread, in the
> >>>>     processDocument method, fix the FieldInfos.add call to also  
> >>>> pass
> >>>>     in your new "storeTermFreq" bit.  Probably, assert that this
> >>>>     cannot change -- ie a field must be created with
> >>>>     storeTermFreq=true or false and must never change.
> >>>>
> >>>>   * The new oal.index.FreqProxTermsWriter, in appendPostings, has  
> >>>> the
> >>>>     code that creates a new segment.  Change that to skip writing  
> >>>> tf
> >>>>     if the FieldInfo says so.
> >>>>
> >>>>   * Fix SegmentTermDocs to not read tf if FieldInfo says so.
> >>>>
> >>>>   * Fix SegmentMerger.appendPostings to not merge/write tf if
> >>>>     FieldInfo says so.  Likewise assert here that the  
> >>>> "storeTermFreq"
> >>>>     does not change in the merged segments.
> >>>>
> >>>> It's also possible to fix FreqProxTermsWriterPerField to not even
> >>>> compute & store the tf in its RawPostingList, per term.  This is an
> >>>> optimization (saves RAM & CPU) that you can do after first  
> >>>> getting the
> >>>> above working...
> >>>>
> >>>> On the search side, you'll need to fix scoring to be OK with tf=0.
> >>>>
> >>>> I think this would be a useful addition to Lucene (it comes up  
> >>>> every
> >>>> so often), even before we fully work out flexible indexing.
> >>>>
> >>>> Mike
> >>>>
> >>>> eks dev wrote:
> >>>>
> >>>>> hi all,
> >>>>> is there any solution to have pure postings lists without
> >>>>> interleaved tf ... this eats a lot of CPU for VInt decoding on  
> >>>>> dense
> >>>>> terms (also doubles IO...)  in our case. Can be a untested patch,
> >>>>> tips how to do it or whatever... I know about flexible indexing,  
> >>>>> but
> >>>>> cannot wait (I guess it will take some time?).
> >>>>>
> >>>>> Does it make sense to start working on it? Can be this somehow  
> >>>>> later
> >>>>> incorporated into Flexible Indexing... I hate to do it and than
> >>>>> throw it away whem Mike doe his magic with Flexible Indexing.
> >>>>>
> >>>>> Simply we are sure this could help performance a lot (some dense
> >>>>> fields have always constant tf, no need to read them from index).
> >>>>> Simply asking for help if somebody accidently happens to have some
> >>>>> Quick 'n Dirty solution/idea.
> >>>>>
> >>>>> thanks, eks
> >>>>>
> >>>>>
> >>>>>
> >>>>>     __________________________________________________________
> >>>>> Not happy with your email address?.
> >>>>> Get the one you really want - millions of new email addresses
> >>>>> available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>
> >>>
> >>>      __________________________________________________________
> >>> Not happy with your email address?.
> >>> Get the one you really want - millions of new email addresses  
> >>> available now at
> >>
> >>> Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>      __________________________________________________________
> >> Not happy with your email address?.
> >> Get the one you really want - millions of new email addresses  
> >> available now at
> >> Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >      __________________________________________________________
> > Not happy with your email address?.
> > Get the one you really want - millions of new email addresses  
> > available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



      __________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index without tf, anyone?

Reply via email to