[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

Tom Burton-West (JIRA) Wed, 12 May 2010 15:18:07 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tom Burton-West updated LUCENE-2393:
------------------------------------

    Attachment: LUCENE-2393.patch

Rewrote argument processing so the default behavior is that of HighFreqTerms.  
The field and number of terms are now both optional with the default being all 
fields and 100 terms (same default as currrent HighFreqTerms).  If a -t flag is 
used the totalTermFreq stats will be read,calculated, and displayed. 

The bug surfaced when not specifying a field.  Added test data with multiple 
fields and tests to check that correct results are returned with and without a 
field being specified.  Fixed bug and new tests pass.

With the increasing number of options, I started thinking about more robust 
command line argument processing.  I'm used to languages where there is a 
commonly used Getopt(s)  library.  There appear to be several for Java with 
different features, different levels of active development and different 
licenses. Is it worth the overhead of using one, and if so which one would be 
the best to use?

Tom


> Utility to output total term frequency and df from a lucene index
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2393
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Tom Burton-West
>            Priority: Trivial
>         Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, 
> LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch
>
>
> This is a pair of command line utilities that provide information on the 
> total number of occurrences of a term in a Lucene index.  The first  takes a 
> field name, term, and index directory and outputs the document frequency for 
> the term and the total number of occurrences of the term in the index (i.e. 
> the sum of the tf of the term for each document).   The second reads the 
> index to determine the top N most frequent terms (by document frequency) and 
> then outputs a list of those terms along with  the document frequency and the 
> total number of occurrences of the term. Both utilities are useful for 
> estimating the size of the term's entry in the *prx files and consequent Disk 
> I/O demands. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

Reply via email to