On Sun, 2 Nov 2008 21:06:56 -0200 Rafael Cunha de Almeida <[EMAIL PROTECTED]> wrote:
> On Sun, 2 Nov 2008 07:11:20 -0500 > Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > > > > On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote: > > > > > Hello, > > > > > > I did an indexer that parses some files and indexes them using > > > lucene. I > > > want to benchmark the whole thing, so I'd like to count the tokens > > > being indexed so I can calculate the average number of indexed tokens > > > per second. Is there a way to count the number of tokens on a > > > document? > > > > I think you would have to add a "CountingTokenFilter", that you write > > and manage as you add documents. Or, you could just take the total # > > of tokens / by the number of docs and use the average. That can be > > obtained w/o writing a new TokenFilter. > > How would I obtain the total number of tokens on an index? I couldn't > find that statistic anywhere. I looked for it on IndexWritter, > IndexReader and IndexSearcher classes. Is there maybe some tool I'd run > on a index or something like that? I'm using PerFieldAnalyzerWrapper, so I tried writing the following Analyzer to count the tokens: import org.apache.lucene.analysis.PerFieldAnalyzerWrapper; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Analyzer; import java.io.Reader; public class CounterAnalyzer extends PerFieldAnalyzerWrapper { public CounterAnalyzer(Analyzer a) { super(a); } public TokenStream tokenStream(String field, Reader reader) { return new CounterFilter(super.tokenStream(field, reader)); } } the CounterFilter is implemented as: import org.apache.lucene.document.NumberTools; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import java.io.IOException; public final class CounterFilter extends TokenFilter { public CounterFilter(TokenStream in) { super(in); } public Token next(Token t) throws IOException { assert t != null; Token nt = input.next(t); if (nt == null) return null; System.out.println("1"); return nt; } } my idea was to pipe the output to an awk script that would count the number of 1s. But, to my surprise, the tokenStream method of the analyzer wasn't even called during the indexing. Could someone instruct me into how should I count the number of tokens on an index? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]