Author: Alexander Barkov Email: Message: Hi Fabien, > Hi all, > > Is it possible to parallelize the indexing work on multiple machines ? > I mean for example one global data server with the mnogosearch database, and > a few server instances all running the indexer process, pointing to the same > db server. > > The idea behind that question would be to create server instances working in > parallel and therefore speed up the whole indexing work, and scale up with > more servers when needed. > > Fabien.
mnoGoSearch supports multiple levels of parallelism. 1. indexer can run parallel threads for crawling, and starting from 3.4.0 for indexing: # Run 10 crawling threads indexer -N10 # Run 10 indexing threads indexer -N10 --index 2. It's possible to run multiple crawling processes on the same machine. Just start "indexer" multiple times. This is very similar to "indexer -N10", but in case if one process crashes for some reason (e.g. a bug), the other parallel processes will safely continue to crawl. Note, this works only for crawling! It's not possible to run multiple indexing processes ("indexer --index") on the same database at the same time. 3. For crawling purposes, it's possible to use #1 and #2 at the same time. Just start "indexer -Nxxx" multiple times. For example, if you start "indexer -N10" ten times, you'll effectively get 100 crawling threads. 4. It is possible to run indexer in crawling mode on multiple machines at the same time. This is very similar to N2, but you just start indexer on different machines. I think this is exactly what you're asking for. To start using this, just copy indexer.conf to multiple machines and make sure to fix DBAddr to point to the same database machine (e.g. change localhost to the actual IP address of the database machine). No any other actions is needed. Note, you can run multiple crawling processes on multiple database, and every process can use multiple threads. For example, you can start: "indexer -N10" ten times on ten machines and you'll effectively get 1000 crawling theads. Note, you can use combinations of the above ways. For example: - Machine A can run an individual single thread crawler - Machine B can run multiple single thread crawlers - Machine C can run an individual multi-thread crawler - Machine D can run multiple multi-thread crawlers At the same time with the same database! Just make sure to have a very fast database server. Consider using faster (e.g. SSD and/or RAID) disks and more RAM to help the database server cache as many index pages as possible. At some point (when running a few dozens or hundreds threads in total) you'll reach a heavy thread contension, so the crawler threads will be waiting for the database to serve them. But there is still a workaround. See #5. 5. And finally, it's possible to distribute data between multiple databases, for even more parallelism. This mode needs some extra configuration. Please see here for details: http://www.mnogosearch.org/doc34/msearch-cluster.html Note, the cluster nodes can reside: - on the same phisical machines with multiple database servers each using its own phisical hard disk - or on different phisical machines. Reply: <http://www.mnogosearch.org/board/message.php?id=21797> _______________________________________________ General mailing list General@mnogosearch.org http://lists.mnogosearch.org/listinfo/general