Re: Index without tf, anyone?

Michael McCandless Fri, 18 Jul 2008 13:20:09 -0700

You could do that, though, it's not as optimized because you'd usefewer bytes if you directly encoded the docDelta (not 2*docDelta+1),and you'd save some CPU when decoding as well. But maybe first do itthis way, then if necessary/it helps/etc, explore the optimization?


Mike

eks dev wrote:

am I boring :)

would it be ok to assume tf == 1 always if we use omitTf? In thatcase docDelta remains odd and current index format interprets thisas tf==1... if all terms have tf == 1 , relative score is factoredout, so it makes no diference.



In that case, there is no need to change anything on reader side!


----- Original Message ----

From: eks dev <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Friday, 18 July, 2008 9:48:04 PM
Subject: Re: Index without tf, anyone?

also, another one:

what should happen with payloads and omitTf options in case
op
storePayloads==true && omitTf==true
shold we say:
1. ignore omitTf and go on with payloads
or
2. disable payloads  and omit tf

other combination are clear

----- Original Message ----

From: eks dev
To: java-dev@lucene.apache.org
Sent: Friday, 18 July, 2008 9:20:09 PM
Subject: Re: Index without tf, anyone?

Mike,
I have started playing with this, holly cow.... it is a lot of code

Question

SegmentMerger. mergeFields()... there is a big block

else {
       addIndexed(reader, fieldInfos,

reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET),

true, true, true, false);
       addIndexed(reader, fieldInfos,
reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION),true,
true, false, false);
       addIndexed(reader, fieldInfos,
reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET),true,
false, true, false);
       addIndexed(reader, fieldInfos,
reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR), true,false, false,
false);
       addIndexed(reader, fieldInfos,
reader.getFieldNames(IndexReader.FieldOption.STORES_PAYLOADS),false, false,
false, true);
       addIndexed(reader, fieldInfos,
reader.getFieldNames(IndexReader.FieldOption.INDEXED), false,false, false,
false);

fieldInfos.add(reader.getFieldNames(IndexReader.FieldOption.UNINDEXED),

false);
     }
I simply do not understand it, have changed addIndexed(...)signature to

include

omitTf, but I am sure what needs to be done here?





----- Original Message ----

From: Michael McCandless
To: java-dev@lucene.apache.org
Sent: Friday, 18 July, 2008 11:48:20 AM
Subject: Re: Index without tf, anyone?

I just committed LUCENE-1301, which is a first step (top down)towards

flexible indexing.  I hope I didn't break anything....

While flexible indexing should make this simpler, it's not toobad to

modify Lucene to do this today, if you want.  I think this is what
you'll need to do (but I haven't tested!):

  * Add something to Fieldable/AbstractField/Field that "knows"
    whether a field should store the tf.  Also add this to

FieldInfo.java, and make sure that bit is saved to the fnmfile.


  * In the new oal.index.DocFieldProcessorPerThread, in the

processDocument method, fix the FieldInfos.add call to alsopass

    in your new "storeTermFreq" bit.  Probably, assert that this
    cannot change -- ie a field must be created with
    storeTermFreq=true or false and must never change.

* The new oal.index.FreqProxTermsWriter, in appendPostings, hasthecode that creates a new segment. Change that to skip writingtf

    if the FieldInfo says so.

  * Fix SegmentTermDocs to not read tf if FieldInfo says so.

  * Fix SegmentMerger.appendPostings to not merge/write tf if

FieldInfo says so. Likewise assert here that the"storeTermFreq"

    does not change in the merged segments.

It's also possible to fix FreqProxTermsWriterPerField to not even
compute & store the tf in its RawPostingList, per term.  This is an

optimization (saves RAM & CPU) that you can do after firstgetting the

above working...

On the search side, you'll need to fix scoring to be OK with tf=0.

I think this would be a useful addition to Lucene (it comes upevery

so often), even before we fully work out flexible indexing.

Mike

eks dev wrote:

hi all,
is there any solution to have pure postings lists without

interleaved tf ... this eats a lot of CPU for VInt decoding ondense

terms (also doubles IO...)  in our case. Can be a untested patch,

tips how to do it or whatever... I know about flexible indexing,but

cannot wait (I guess it will take some time?).

Does it make sense to start working on it? Can be this somehowlater

incorporated into Flexible Indexing... I hate to do it and than
throw it away whem Mike doe his magic with Flexible Indexing.

Simply we are sure this could help performance a lot (some dense
fields have always constant tf, no need to read them from index).
Simply asking for help if somebody accidently happens to have some
Quick 'n Dirty solution/idea.

thanks, eks



    __________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses
available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




     __________________________________________________________
Not happy with your email address?.

Get the one you really want - millions of new email addressesavailable now at

Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




     __________________________________________________________
Not happy with your email address?.

Get the one you really want - millions of new email addressesavailable now at

Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




     __________________________________________________________
Not happy with your email address?.

Get the one you really want - millions of new email addressesavailable now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index without tf, anyone?

Reply via email to