As one can probably guess, I have been looking at the EnwikiDocMaker a
bit and using it outside of the benchmark suite, as related to the new
contrib/wikipedia stuff. Just wanted to make sure I have a good
basic understanding of what it is doing, because I am looking for ways
to speed it up, so correct me if I am wrong, please:
The basic gist of it is, there is a background thread that gets kicked
off by the first next() call and is responsible for parsing and
loading the tuples one at a time, right? Thus, the main
makeDocument() method waits until a tuple is available from this
thread and then it returns it once it is notified that one is
available, right?
As we've discussed in the past, the EnwikiDocMaker is a bottleneck in
the benchmark when it comes to running multiple indexing threads. So,
I was thinking of a couple of different options and wanted to get an
opinion on what seems the most worthwhile to pursue:
1. Implement a some sort of splitting version of the DocMaker that has
multiple threads, each responsible for parsing a certain section of
the file. This would require us to know the number of documents ahead
of time, but that isn't a big deal, as one could either statically set
it, or write a little utility that counts the docs. Thus, one could
either hide this in the doc maker or construct multiple doc makers,
each with their own range. Taking this a step further, the utility
could output the file pointers where each range of documents starts,
so that each thread could skip ahead to that point (possibly, not sure
how that would work with a XML parser)
2. Implement some sort of tuple buffering, whereby the reading thread
reads multiple documents at a time and buffers them, then makeDocument
can consume the buffer and only has to wait/exit when the buffer is
empty. The producer thread could just work to fill the buffer at all
times unless it receives a quit message.
3. Split the large XML file into X smaller files and run them
independently. Thus, if you have 4 threads, split the file into 4
files and treat them separately. This is an easier to get right
version of #1.
Thoughts?
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]