[General] Webboard: How to speed up the crawl delay after each URL ?

bar Sat, 14 May 2016 01:41:31 -0700

Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I add those Disallow lines so that, both app can crawl the same 
> number of urls (approximatively 140), 
> 
> Disallow */basket-villeurbanne/author/*
> Disallow *?p=*
> Disallow */feed
> 
> As it seems that Mnogosearch can manage a robots.txt, but not the 
> meta robots noindex,follow....
>



mnoGoSearch supports meta robots.
Can you please give an URL of a document whose robots directives are ignored. 
I'll check what's happenning.


> Here are the results :
> 
> -----
> 
> indexer -C;
> indexer;
> 
> [18898]{01} Done (53 seconds, 168 documents, 3503752 bytes, 64.56 
> Kbytes/sec.)
> 
> ------
> 
> indexer -C;
> inexer -N5;
> 
> [19261]{02} Done (15 seconds, 46 documents, 982938 bytes, 63.99 
> Kbytes/sec.)
> [19261]{03} Done (15 seconds, 48 documents, 930200 bytes, 60.56 
> Kbytes/sec.)
> [19261]{01} Done (5 seconds, 14 documents, 323667 bytes, 63.22 
> Kbytes/sec.)
> [19261]{05} Done (15 seconds, 46 documents, 974427 bytes, 63.44 
> Kbytes/sec.)
> [19261]{04} Done (5 seconds, 14 documents, 292520 bytes, 57.13 
> Kbytes/sec.)
> [19261]{--} Done (26 seconds, 168 documents, 3503752 bytes, 131.60 
> Kbytes/sec.)
> 
> 
> indexer -C;
> indexer -N50;
> [20289]{11} Done (11 seconds, 28 documents, 585571 bytes, 51.99 
> Kbytes/sec.)
> [20289]{28} Done (11 seconds, 29 documents, 705247 bytes, 62.61 
> Kbytes/sec.)
> [20289]{16} Done (11 seconds, 30 documents, 635782 bytes, 56.44 
> Kbytes/sec.)
> [20289]{30} Done (11 seconds, 30 documents, 635178 bytes, 56.39 
> Kbytes/sec.)
> [20289]{--} Done (21 seconds, 168 documents, 3504392 bytes, 162.96 
> Kbytes/sec.)
> 
> 
> mysql -uroot -p -N --database=db_test_mnogo --execute="SELECT url 
> FROM url" > ~/ALL.txt;
> 
> (cat ~/ALL.txt | parallel -j8 --gnu "wget {}");
> 
> real  0m10.638s
> user  0m1.256s
> sys   0m1.519s
> 
> 
> ---
> 
> Screaming Frog : 12s
> 
> 
> It just confirm the fact that Mnogosearch is relatively slower than 
> Sreaming Frog, and even when i compare to parallel wget bash, 
> mnogosearch is slower.
> 
> It get little better with indexer -N50 though.


Well, this effect can happen with a *small* site, with an empty database.

When indexer starts multiple threads (say 10) and the database is empty, 9 
threads immediately go to sleep for 10 seconds.
So only the first thread is actually working.

After 10 seconds the database is not empty, because the first thread has 
collected some links.

So it actually start working in multi-thread mode after 10 seconds only.

With a bigger site you will not see any difference between mnoGoSearch
vs wget/frog.


If you really need to crawl a small site quickly,
please apply this patch:

<patch>
=== modified file 'src/indexer.c'
--- src/indexer.c       2016-03-30 12:13:49 +0000
+++ src/indexer.c       2016-05-14 08:28:25 +0000
@@ -2872,7 +2872,7 @@ int maxthreads=       1;
 UDM_CRAWLER *ThreadCrawlers= NULL;
 int thd_errors= 0;
 
-#define UDM_NOTARGETS_SLEEP 10
+#define UDM_NOTARGETS_SLEEP 0
 
 #ifdef  WIN32
 unsigned int __stdcall UdmCrawlerMain(void *arg)
</patch>


Here are the results:

./indexer -Cw ; ./indexer -N10
[5853]{--} Done (12 seconds, 168 documents, 3504192 bytes, 285.17 Kbytes/sec.)


It's now as fast as wget and frog, and it crawls more documents (168 vs 140).


Please note:
Aggressive crawling is not polite and can be even considered as an 
attack. It better not to crawl sites that way, unless it is your
own site, or unless site owners allow you to do it this way.



Reply: <http://www.mnogosearch.org/board/message.php?id=21768>

_______________________________________________
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general

[General] Webboard: How to speed up the crawl delay after each URL ?

Reply via email to