parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1 --------------------------------------------------------------------------------------------------------
Key: NUTCH-817 URL: https://issues.apache.org/jira/browse/NUTCH-817 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Suse linux 11.1, java version "1.6.0_13" Reporter: matthew a. grisius submitted per Julien Nioche. I did not see where to attach a file so I pasted it here. btw: Tika command line returns empty html body for this file. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"> <!--NewPage--> <HTML> <HEAD> <!-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008--> <TITLE> Matrix Application Development Kit </TITLE> <SCRIPT type="text/javascript"> targetPage = "" + window.location.search; if (targetPage != "" && targetPage != "undefined") targetPage = targetPage.substring(1); function loadFrames() { if (targetPage != "" && targetPage != "undefined") top.classFrame.location = top.targetPage; } </SCRIPT> <NOSCRIPT> </NOSCRIPT> </HEAD> <FRAMESET cols="20%,80%" title="" onLoad="top.loadFrames()"> <FRAMESET rows="30%,70%" title="" onLoad="top.loadFrames()"> <FRAME src="overview-frame.html" name="packageListFrame" title="All Packages"> <FRAME src="allclasses-frame.html" name="packageFrame" title="All classes and interfaces (except non-static nested types)"> </FRAMESET> <FRAME src="overview-summary.html" name="classFrame" title="Package, class and interface descriptions" scrolling="yes"> <NOFRAMES> <H2> Frame Alert</H2> <P> This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. <BR> Link to<A HREF="overview-summary.html">Non-frame version.</A> </NOFRAMES> </FRAMESET> </HTML> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.