I find that 'cached.jsp' executes the scripts that have been cached
along with the pages. This is not wrong as such. But this becomes a
security concern when the Nutch search engine is a part of a website
that implements authentication and authorization.

If the original page has a malicious script, the script will be run
when a visitor visits its corresponding cached page in the Nutch
search engine. If the script is a cookie stealer, then it would allow
the attacker to steal the session cookies of an authenticated user and
hijack his session.

As a result, search engines like Google, Yahoo, etc. have the cache on
a different address, so that the scripts can not steal the cookies set
by the domains like google.com, yahoo.com, etc. The same practice has
to be followed with Nutch too, if the website it is hosted on,
contains such sensitive cookies.

I am not sure whether it is possible to extract only the cache details
from crawl DB and take it to a different server. So, currently I can
imagine the following method only to do this:-

1. Delete 'cached.jsp' from the $CATALINA_HOME/webapps/ROOT
2. Take a copy of 'crawl' DB and take it to a different server.
3. Modify 'search.jsp' so that the the 'Cached' link points to
'cached.jsp' in the other server.
4. Run two instances of tomcat server with Nutch, one for the web GUI
for search and the other for the cached.jsp only.

Is there a better way to achieve this? If not, shouldn't the link to
'cached.jsp' be made configurable? I would appreciate if someone can
suggest something regarding this issue.

Regards,
Susam Pal
http://susam.in/

Reply via email to