Grant Ingersoll wrote:
Also, perhaps we should spin off another thread to discuss how to make DocsWriter easier to maintain. My biggest concern is understanding how the various threads work together, and a few other areas but, like I said, let's spin up a separate thread to brainstorm what is needed.
I agree we should work on simplifying it with time, and, spreading the knowledge of how it works.
Note, that there is some risk in just using wikipedia for profiling given it's distribution of terms, etc..
Good point. Previously I was using Europarl, but, that corpus is just too fast to index.
Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms not normally seen with clean content)? Since I'm using StandardAnalyzer and not an analyzer based on the new WikipediaTokenizer, I'm getting even extra terms. Also, I think we'd need an HTMLFilter in the chain since Wikipedia content uses HTML markup. Grant, what analyzer chain do you use when you index Wikipedia?
I also wonder if using the LineDocMaker is all that realistic a profiling scenario. While it is really useful in that it minimizes IO interaction, etc. I can't help but feel that it isn't at all close to typical usage. Most users are not going to have all their docs rolled up into a single file, 1 doc per line, so I wonder if we potentially lose insight into how Lucene performs given that other issues like I/O/memory used for loading files may force the JVM/Lucene to not have the resources it needs. Of course, I do know it is good to try to isolate things so we can focus just on Lucene, but we also should try to make some accounting for how it lives in the wild.
I agree, this part is not realistic, and the intention is to measure just the indexing time. In fact I expect most apps spend quite a bit more time building up a Document (filtering binary docs, etc) than actually indexing it. The only real-world app that I can think of that would be close to LineDocMaker is using Lucene to search big log files, where one line = one Document.
Last, I think it would be good to always attach/check in the .alg file that is used when running the test, so that others can verify on different systems/configurations, etc.
I did post the alg (under LUCENE-1172). Though I see I forgot to {code} it and it looks messed up now. My recent test to try a single quickSort(Object[]) were the same alg, just repeated 10 times instead of 3.
But I agree we should always post the alg for all tests... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]