[
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617140#action_12617140
]
Eks Dev commented on LUCENE-1340:
---------------------------------
we finished our tests
Index without omitTf() :
- 87Mio Documents, 2 indexed Fields one stored field
- Unique terms in index 2.5Mio
- Average Field lengths in tokens: 3.3 and 5.5 (very short fields)
- On Disk size 3.8 Gb total with stored field
Queries under test:
- BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with
minNumberShouldMatch()) . with a lot of clauses (5-100).
- Filter used, yes
Test scope, regression with 30k Queries on the same index with
omitTf(true/false).
Result:
- The Queries returned 100% identical Hits (full recall tested, all hits
checked)!
- Index size reduction(not including stored field!): 7% (short documents =>
less positions than in Mike's case)
- Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on
disk setup should bring even more due to the reduced IO for reading postings)
-Indexing performance (FSDisk!) 13% faster
Also, we compared omitTf(false) with this patch and lucene.jar without this
patch, no changes whatsoever.
>From my perspective, this is good to go into production. At least for our
>usage of lucene, there are no differences with homitTf(true)...
>One more thing here: since the tiis are loaded into RAM, that unused
>proxPointer wastes 8 bytes for each indexed terms. For indices with alot of
>terms this can add up to alot of wasted ram. But still I think we should wait
>and fix this as part of flexible indexing, when we maybe refactor the
>TermInfos to be "column stride" instead.
I am more than happy with the results, no need to squeeze the last bit out of
it right now.
Mike, thanks again for the great work!
> Make it posible not to include TF information in index
> ------------------------------------------------------
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Eks Dev
> Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch,
> LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Term Frequency is typically not needed for all fields, some CPU (reading one
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields
> possible in Lucene. This topic has already been discussed and accepted as a
> part of Flexible Indexing... This issue tries to push things a bit faster
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters,
> enumerations, user rights, IDs or very short "texts", phone numbers, zip
> codes, names...
> Status: just passed standard test (compatibility), commited for early review,
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]