[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Eks Dev (JIRA) Sat, 26 Jul 2008 02:47:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617140#action_12617140
 ]


Eks Dev commented on LUCENE-1340:
---------------------------------

we  finished our tests

Index without omitTf() :
- 87Mio Documents, 2 indexed Fields one stored field
- Unique terms in index 2.5Mio
- Average Field lengths in tokens: 3.3 and 5.5 (very short fields)
- On Disk size 3.8 Gb total with stored field
 
Queries under test: 
- BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with 
minNumberShouldMatch()) . with a lot of clauses (5-100).
- Filter used, yes

Test scope, regression with 30k Queries on the same index with 
omitTf(true/false).

Result:

- The Queries returned 100% identical Hits (full recall tested, all hits 
checked)!

- Index size reduction(not including stored field!): 7% (short documents => 
less positions than in Mike's case)

- Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on 
disk setup should bring even more due to the reduced IO for reading postings)

-Indexing performance (FSDisk!) 13% faster

Also, we compared omitTf(false) with this patch and lucene.jar without this 
patch, no changes whatsoever.

>From my perspective, this is good to go into production. At least for our 
>usage of lucene, there are no differences with homitTf(true)... 

>One more thing here: since the tiis are loaded into RAM, that unused 
>proxPointer wastes 8 bytes for each indexed terms. For indices with alot of 
>terms this can add up to alot of wasted ram. But still I think we should wait 
>and fix this as part of flexible indexing, when we maybe refactor the 
>TermInfos to be "column stride" instead.

I am more than happy with the results, no need to squeeze the last bit out of 
it right now.

Mike, thanks again for the great work! 



> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Reply via email to