[Nutch-general] Links not extracted / parsing stops

Franz Werfel Tue, 14 Mar 2006 08:18:01 -0800

Hello,
Is it possible that Nutch (0.7.1) can stop to look for urls in a html
file because of an error in the file? -- I have this impression but I
don't know how to test it to be sure.


Here is what I have done:
- the file is 34 kb (so there is no content-length limit)
- there are approx. 100 links in it
- but only the first 54 are identified, then non of the following ones
- however no error is reported by Nutch
- the regexp-urlfilter file only contains this line: +.

I was wondering if it was the structure of the links themselves but I
tried to put them in another file and they were identified fine.

The file has quite a lot of javascript in it.

If Nutch indeed does stop parsing, does it report the error somewhere?

Thanks, Fr.


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Links not extracted / parsing stops

Reply via email to