EnwikiDocMaker ?

Grant Ingersoll Wed, 09 Jan 2008 05:55:47 -0800

As one can probably guess, I have been looking at the EnwikiDocMaker abit and using it outside of the benchmark suite, as related to the newcontrib/wikipedia stuff. Just wanted to make sure I have a goodbasic understanding of what it is doing, because I am looking for waysto speed it up, so correct me if I am wrong, please:

The basic gist of it is, there is a background thread that gets kickedoff by the first next() call and is responsible for parsing andloading the tuples one at a time, right? Thus, the mainmakeDocument() method waits until a tuple is available from thisthread and then it returns it once it is notified that one isavailable, right?

As we've discussed in the past, the EnwikiDocMaker is a bottleneck inthe benchmark when it comes to running multiple indexing threads. So,I was thinking of a couple of different options and wanted to get anopinion on what seems the most worthwhile to pursue:

1. Implement a some sort of splitting version of the DocMaker that hasmultiple threads, each responsible for parsing a certain section ofthe file. This would require us to know the number of documents aheadof time, but that isn't a big deal, as one could either statically setit, or write a little utility that counts the docs. Thus, one couldeither hide this in the doc maker or construct multiple doc makers,each with their own range. Taking this a step further, the utilitycould output the file pointers where each range of documents starts,so that each thread could skip ahead to that point (possibly, not surehow that would work with a XML parser)

2. Implement some sort of tuple buffering, whereby the reading threadreads multiple documents at a time and buffers them, then makeDocumentcan consume the buffer and only has to wait/exit when the buffer isempty. The producer thread could just work to fill the buffer at alltimes unless it receives a quit message.

3. Split the large XML file into X smaller files and run themindependently. Thus, if you have 4 threads, split the file into 4files and treat them separately. This is an easier to get rightversion of #1.


Thoughts?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

EnwikiDocMaker ?

Reply via email to