[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617140#action_12617140 ]
Eks Dev commented on LUCENE-1340: --------------------------------- we finished our tests Index without omitTf() : - 87Mio Documents, 2 indexed Fields one stored field - Unique terms in index 2.5Mio - Average Field lengths in tokens: 3.3 and 5.5 (very short fields) - On Disk size 3.8 Gb total with stored field Queries under test: - BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with minNumberShouldMatch()) . with a lot of clauses (5-100). - Filter used, yes Test scope, regression with 30k Queries on the same index with omitTf(true/false). Result: - The Queries returned 100% identical Hits (full recall tested, all hits checked)! - Index size reduction(not including stored field!): 7% (short documents => less positions than in Mike's case) - Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on disk setup should bring even more due to the reduced IO for reading postings) -Indexing performance (FSDisk!) 13% faster Also, we compared omitTf(false) with this patch and lucene.jar without this patch, no changes whatsoever. >From my perspective, this is good to go into production. At least for our >usage of lucene, there are no differences with homitTf(true)... >One more thing here: since the tiis are loaded into RAM, that unused >proxPointer wastes 8 bytes for each indexed terms. For indices with alot of >terms this can add up to alot of wasted ram. But still I think we should wait >and fix this as part of flexible indexing, when we maybe refactor the >TermInfos to be "column stride" instead. I am more than happy with the results, no need to squeeze the last bit out of it right now. Mike, thanks again for the great work! > Make it posible not to include TF information in index > ------------------------------------------------------ > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Eks Dev > Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]