Hello, Is it possible that Nutch (0.7.1) can stop to look for urls in a html file because of an error in the file? -- I have this impression but I don't know how to test it to be sure.
Here is what I have done: - the file is 34 kb (so there is no content-length limit) - there are approx. 100 links in it - but only the first 54 are identified, then non of the following ones - however no error is reported by Nutch - the regexp-urlfilter file only contains this line: +. I was wondering if it was the structure of the links themselves but I tried to put them in another file and they were identified fine. The file has quite a lot of javascript in it. If Nutch indeed does stop parsing, does it report the error somewhere? Thanks, Fr.
