[
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1340:
---------------------------------------
Attachment: LUCENE-1340.patch
I attached a new rev of the patch:
* Use less RAM if field omits tf's (don't write the tf's into the RAM
buffer), so we flush less often
* Added another test case to TestOmitTf
As a test, I indexed full wikipedia (~3.2 million docs) with this alg:
{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = false
doc.term.vector = false
doc.add.log.step=10000
max.field.length=2147483647
directory=FSDirectory
autocommit=false
compound=false
doc.maker.forever = false
work.dir=/lucene/work2
ram.flush.mb=64
- CreateIndex
{ "AddDocs" AddDoc > : *
- CloseIndex
RepSumByPrefRound AddDoc
{code}
With tf's it takes 970 seconds and index size is 2.5 GB. Without tf's
it takes 834 seconds (14% faster) and index size is 1.1 GB (56%
smaller).
> Make it posible not to include TF information in index
> ------------------------------------------------------
>
> Key: LUCENE-1340
> URL: https://issues.apache.org/jira/browse/LUCENE-1340
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Eks Dev
> Priority: Minor
> Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch,
> LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Term Frequency is typically not needed for all fields, some CPU (reading one
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields
> possible in Lucene. This topic has already been discussed and accepted as a
> part of Flexible Indexing... This issue tries to push things a bit faster
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters,
> enumerations, user rights, IDs or very short "texts", phone numbers, zip
> codes, names...
> Status: just passed standard test (compatibility), commited for early review,
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]