Hi All,

 

 

Ok, let’s push something above average into this aged topic,

What about Robot?

 

Recently I browsed Nutch (and Yahoo invested a lot in India…); and Bixo (and
I invested something…)

 

Both still have problems, and (if you are really trying to be “Polite
Robot”): 



1.       Don’t follow http://www.robotstxt.org – this is privately owned
website, information there is outdated at least 12 years, don’t forget to
click Google Ads.

2.       There are some new standards, and all such standards were pushed…
by robots!

3.       …. A LOT!!!

 

I was thinking about coding style… projects such as a NUtch, SOLR, BIXO,
Cascading… use static classes, so that codebase seems very small, single
class can do 10 times more that yours 100 classes…

 

But I think it’s better to improve existing codebase instead of complete
rewrite (and, very bad: copy-paste!)

 

Here are some improvements which I am going to work on, tell me if you are
interested:

 

-          Speel Check: user-agent, useragent, usreagent, …….

-          UTF8: according to specs, URL must be decoded before applying
rule from robots.txt; additionally, %2F need not be decoded! 

 

For instance, both Nutch and BIXO rely on Droids, but nothing happens…

 

I think framework should be clear enough so that we can add new rules (such
as “recrawl rate” or “sitemap” of even “custom domain-specific rules” (such
as Nutch RegEx Filter)) 

 

I want to push some code but I think it’s much better to follow Nutch coding
style (local/static/private) instead of this extremely naïve “interface &
implementation”…

 

Thanks,

 

 

Fuad Efendi

+1 416-993-2060

 <http://www.linkedin.com/in/liferay> http://www.linkedin.com/in/liferay

 

Tokenizer Inc.

 <http://www.tokenizer.ca/> http://www.tokenizer.ca/

Data Mining, Vertical Search

 

(sorry for Search Engine Optimization trick… but it is so popular here!)

 

Reply via email to