[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1340: --------------------------------------- Attachment: LUCENE-1340.patch OK good progress eks! I started from your latest patch and made some further changes: * Fixed DW to not consume RAM writing prx if omitTf==true * Fixed FreqProxTermsWriter to not create *.prx file if all fields omit term freq. I added hasProx to SegmentInfo, and changed the index file format to store this new boolean. * Fixed FreqProxTermsWriterPerField to not write prox into the RAM buffer if we will omitTf on flushing the segment to disk. This makes the RAM buffer efficient (no bytes wasted on prox when omitTf==true for a field). * Added more test cases to TestOmitTf * Small whitespace, comment changes The one place I know of that will still waste bytes is the term dict (TermInfo): it stores a long proxPointer on disk (in *.tii,*.tis) and also in memory because we load *.tii into RAM. For fields with omitTf==true this will always be unused, and we could save alot of disk/RAM if we didn't waste it. Unfortunately, I think it's too big a change to try to fix this now; I think we should wait until flex indexing is online. I wonder how we can solve it at that point: maybe should we change TermInfo to be "column stride", meaning, there are separate arrays storing the values for all terms (ie long[] proxPointers, long[] freqPointers, etc.). This would also fit the "pluggable" model better, meaning any plugin can store new stuff (its own arrays) per-term. > Make it posible not to include TF information in index > ------------------------------------------------------ > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Eks Dev > Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]