Author: Alexander Barkov
Hi Fabien,

> Hi all,
> Is it possible to parallelize the indexing work on multiple machines ?
> I mean for example one global data server with the mnogosearch database, and 
> a few server instances all running the indexer process, pointing to the same 
> db server.
> The idea behind that question would be to create server instances working in 
> parallel and therefore speed up the whole indexing work, and scale up with 
> more servers when needed.
> Fabien.

mnoGoSearch supports multiple levels of parallelism.

1. indexer can run parallel threads for crawling, and starting from 3.4.0 for 

# Run 10 crawling threads
indexer -N10

# Run 10 indexing threads
indexer -N10 --index

2. It's possible to run multiple crawling processes on the same machine. Just 
start "indexer" multiple times.
This is very similar to "indexer -N10", but in case if
one process crashes for some reason (e.g. a bug), the other
parallel processes will safely continue to crawl.

Note, this works only for crawling! It's not possible to run
multiple indexing processes ("indexer --index") on the same
database at the same time.

3. For crawling purposes, it's possible to use #1 and #2 at the same time. Just 
start "indexer -Nxxx" multiple times.
For example, if you start "indexer -N10" ten times,
you'll effectively get 100 crawling threads.

4. It is possible to run indexer in crawling mode on
multiple machines at the same time. This is very similar to N2,
but you just start indexer on different machines.

I think this is exactly what you're asking for.

To start using this, just copy indexer.conf to multiple machines
and make sure to fix DBAddr to point to the same database machine
(e.g. change localhost to the actual IP address of the database machine). No 
any other actions is needed.

Note, you can run multiple crawling processes on multiple database,
and every process can use multiple threads.

For example, you can start: "indexer -N10" ten times on ten machines
and you'll effectively get 1000 crawling theads.

Note, you can use combinations of the above ways.
For example:
- Machine A can run an individual single thread crawler
- Machine B can run multiple single thread crawlers
- Machine C can run an individual multi-thread crawler
- Machine D can run multiple multi-thread crawlers

At the same time with the same database!
Just make sure to have a very fast database server.
Consider using faster (e.g. SSD and/or RAID) disks and more
RAM to help the database server cache as many index pages
as possible.

At some point (when running a few dozens or hundreds threads in total)
you'll reach a heavy thread contension, so the crawler threads
will be waiting for the database to serve them. But there is still
a workaround. See #5.

5. And finally, it's possible to distribute data between multiple
databases, for even more parallelism. This mode needs some extra
configuration. Please see here for details:

Note, the cluster nodes can reside:
- on the same phisical machines with multiple database servers each using its 
own phisical hard disk
- or on different phisical machines.

Reply: <>

General mailing list

Reply via email to