Mike Smith wrote:
I finally find out why this problem happens, there should be a problem with
the JS parser. Because I used this:
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
instead of the default one which has JS in it and I could fetch
http://www.globalmedlaw.com/Canadam.html by depth 2. But, when I use
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
reduce will fail at the end fetching. I came up with this solution because
that page was using a redirected JS page to have some dynamic contents, but
by removing the JS plugin it worked fine. Now, I am going to have a larger
crawl over 100,000 seed urls to see if this really solved the problem.
Do you have any problem with JS parser?
That's an interesting observation. Could you perhaps check what is the
exception (if any) from the JS parser when it's failing? It could be
emitted into one of the tasktracker logs.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general