[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tom Burton-West updated LUCENE-2393: ------------------------------------ Attachment: LUCENE-2393.patch Rewrote argument processing so the default behavior is that of HighFreqTerms. The field and number of terms are now both optional with the default being all fields and 100 terms (same default as currrent HighFreqTerms). If a -t flag is used the totalTermFreq stats will be read,calculated, and displayed. The bug surfaced when not specifying a field. Added test data with multiple fields and tests to check that correct results are returned with and without a field being specified. Fixed bug and new tests pass. With the increasing number of options, I started thinking about more robust command line argument processing. I'm used to languages where there is a commonly used Getopt(s) library. There appear to be several for Java with different features, different levels of active development and different licenses. Is it worth the overhead of using one, and if so which one would be the best to use? Tom > Utility to output total term frequency and df from a lucene index > ----------------------------------------------------------------- > > Key: LUCENE-2393 > URL: https://issues.apache.org/jira/browse/LUCENE-2393 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Tom Burton-West > Priority: Trivial > Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, > LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch > > > This is a pair of command line utilities that provide information on the > total number of occurrences of a term in a Lucene index. The first takes a > field name, term, and index directory and outputs the document frequency for > the term and the total number of occurrences of the term in the index (i.e. > the sum of the tf of the term for each document). The second reads the > index to determine the top N most frequent terms (by document frequency) and > then outputs a list of those terms along with the document frequency and the > total number of occurrences of the term. Both utilities are useful for > estimating the size of the term's entry in the *prx files and consequent Disk > I/O demands. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org