Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

Michael McCandless Mon, 11 Feb 2008 11:30:36 -0800


Grant Ingersoll wrote:

Also, perhaps we should spin off another thread to discuss how tomake DocsWriter easier to maintain. My biggest concern isunderstanding how the various threads work together, and a fewother areas but, like I said, let's spin up a separate thread tobrainstorm what is needed.

I agree we should work on simplifying it with time, and, spreadingthe knowledge of how it works.

Note, that there is some risk in just using wikipedia for profilinggiven it's distribution of terms, etc..

Good point. Previously I was using Europarl, but, that corpus isjust too fast to index.

Are you thinking Wikipedia is somewhat "dirty" (lots of extra termsnot normally seen with clean content)? Since I'm usingStandardAnalyzer and not an analyzer based on the newWikipediaTokenizer, I'm getting even extra terms. Also, I think we'dneed an HTMLFilter in the chain since Wikipedia content uses HTMLmarkup. Grant, what analyzer chain do you use when you index Wikipedia?

I also wonder if using the LineDocMaker is all that realistic aprofiling scenario. While it is really useful in that it minimizesIO interaction, etc. I can't help but feel that it isn't at allclose to typical usage. Most users are not going to have all theirdocs rolled up into a single file, 1 doc per line, so I wonder ifwe potentially lose insight into how Lucene performs given thatother issues like I/O/memory used for loading files may force theJVM/Lucene to not have the resources it needs. Of course, I doknow it is good to try to isolate things so we can focus just onLucene, but we also should try to make some accounting for how itlives in the wild.

I agree, this part is not realistic, and the intention is to measurejust the indexing time. In fact I expect most apps spend quite a bitmore time building up a Document (filtering binary docs, etc) thanactually indexing it. The only real-world app that I can think ofthat would be close to LineDocMaker is using Lucene to search big logfiles, where one line = one Document.

Last, I think it would be good to always attach/check in the .algfile that is used when running the test, so that others can verifyon different systems/configurations, etc.

I did post the alg (under LUCENE-1172). Though I see I forgot to{code} it and it looks messed up now. My recent test to try a singlequickSort(Object[]) were the same alg, just repeated 10 times insteadof 3.


But I agree we should always post the alg for all tests...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

Reply via email to