[General] Webboard: Antispam algorythm

2013-12-11 Thread bar
Author: fasfuuiios
Email: 
Message:
Currently it looks like there is no way to stop indexing of spammed 
sites. Link spammers even spam this board automatically from time to 
time. That software is very pluggable and can be adapted for any type 
of cms and submit forms. 

I thought about global dirty solution that could haunt spam during 
indexing process. Here is the idea.

-

Say we have new option for 3.4 + versions:

ExternalLinkCount [maxlinks] [maxpages] [nofollow]

maxlinks is the limit for external links on page. (Spammers are trying 
to add direct links for pagerank etc.)

maxpages is the limit for probably spammed pages on same host.

nofollow is true or false. Filter only spam pages with or without 
rel=nofollow

---

Examples:

ExternalLinkCount 20

This will delete any page which has more than 20 external links.

ExternalLinkCount 20 20

This will automatically ban and remove site that has more than 20 
pages where each page has more than 20 external links.

ExternalLinkCount 20 20 true

This will do previos thing with and without nofollow links.

ExternalLinkCount 20 20 false

Only for direct links that play with pagerank etc.

---

This is not ideal. It can cut normal pages. But those webmasters who 
use nofollow as google recommended are rather safe. This can cut blog 
pages with tons of good comments.
Big scientific pages, catalogs and wikis are not probably safe from 
such dirty filtering. 

Anyway this is probably the simplest way to catch those sites that 
have tons of spammed pages. With high limits it could probably help.



Example of site that is currently under spam attack. It generates 
thousands of such spammed pages. That is why I thought about this 
problem in very basic but cruel way.

http://www.gksbeton.ru/index.php/peremychki-pb/item/35-novost-1/35-
novost-1?start=400


Reply: http://www.mnogosearch.org/board/message.php?id=21609

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Antispam algorythm

2013-12-11 Thread bar
Author: fasfuuiios
Email: 
Message:
I'm not completely sure that it's good idea but probably it is better 
than nothing at all to stop this. Of course, it needs tests and 
analysis. I believe that normal html page has no more than 5 external 
links. Currently even paid links are usually limited to 3, and they are 
located inside of article to avoid google filter penalties etc. 

Reply: http://www.mnogosearch.org/board/message.php?id=21610

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: cpu usage

2013-12-11 Thread bar
Author: fasfuuiios
Email: 
Message:
I have noted that even if I start indexer with 5 or 10 or 20 or 40 
threads with CrawlerThreads option in indexer.conf, top command is 
always showing not more than 40% of cpu and very rarely it can rise up 
to 55%. With more threads it can slightly ddos some sites and they 
give 503 error or even 508. Using munin for server monitoring is 
showing rather stable perfomance without high cpu and memory usage 
during indexation. Sometimes indexer hangs but I check it with cron 
each minute and start it again if it is not active.

* * * * *   rootpgrep indexer  /dev/null || 
/usr/local/mnogosearch/sbin/indexer -l

Does mnogosearch has some internal perfomance limitations for indexer 
to make possible parallel searches and indexing? Or maybe I have 
missed something in compiling options or any special options in 
indexer.conf? I have not experimented with more than one indexer 
processes. Is it possible to achieve 80% of cpu usage constantly? If 
yes, what is the safest and stablest way to do it, if server is used 
only for indexing?

Or maybe it is good practice to limit indexer? I have seen php 
crawlers that can easily eat 90% of cpu. Of course, their slow 
perfomance are not compared with mnogosearch high speed. It works very 
fast. But of course, it is interesting how to load server completely 
during indexing.

 

Reply: http://www.mnogosearch.org/board/message.php?id=21611

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general