[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Michael McCandless (JIRA) Thu, 24 Jul 2008 09:02:26 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-1340:
---------------------------------------

    Attachment: LUCENE-1340.patch

I attached a new rev of the patch:

  * Use less RAM if field omits tf's (don't write the tf's into the RAM 
buffer), so we flush less often

  * Added another test case to TestOmitTf

As a test, I indexed full wikipedia (~3.2 million docs) with this alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = false
doc.term.vector = false
doc.add.log.step=10000
max.field.length=2147483647

directory=FSDirectory
autocommit=false
compound=false
doc.maker.forever = false

work.dir=/lucene/work2
ram.flush.mb=64

- CreateIndex
{ "AddDocs" AddDoc > : *
- CloseIndex

RepSumByPrefRound AddDoc

{code}

With tf's it takes 970 seconds and index size is 2.5 GB.  Without tf's
it takes 834 seconds (14% faster) and index size is 1.1 GB (56%
smaller).


> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
> LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one 
> VInt less and one X>>>1...) and IO can be spared by making pure boolen fields 
> possible in Lucene. This topic has already been discussed and accepted as a 
> part of Flexible Indexing... This issue tries to push things a bit faster 
> forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, 
> enumerations, user rights, IDs or very short "texts", phone  numbers, zip 
> codes, names...
> Status: just passed standard test (compatibility), commited for early review, 
> I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Reply via email to