parse-(html)does follow links of full html page, parse-(tika) does follow any 
links and stops at level 1
--------------------------------------------------------------------------------------------------------

                 Key: NUTCH-817
                 URL: https://issues.apache.org/jira/browse/NUTCH-817
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.1
         Environment: Suse linux 11.1, java version "1.6.0_13"
            Reporter: matthew a. grisius


submitted per Julien Nioche. I did not see where to attach a file so I pasted 
it here. btw: Tika command line returns empty html body for this file.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" 
"http://www.w3.org/TR/html4/frameset.dtd";>

<!--NewPage-->

<HTML>

<HEAD>

<!-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008-->

<TITLE>

Matrix Application Development Kit

</TITLE>

<SCRIPT type="text/javascript">

    targetPage = "" + window.location.search;

    if (targetPage != "" && targetPage != "undefined")

       targetPage = targetPage.substring(1);

    function loadFrames() {

        if (targetPage != "" && targetPage != "undefined")

             top.classFrame.location = top.targetPage;

    }

</SCRIPT>

<NOSCRIPT>

</NOSCRIPT>

</HEAD>

<FRAMESET cols="20%,80%" title="" onLoad="top.loadFrames()">

<FRAMESET rows="30%,70%" title="" onLoad="top.loadFrames()">

<FRAME src="overview-frame.html" name="packageListFrame" title="All Packages">

<FRAME src="allclasses-frame.html" name="packageFrame" title="All classes and 
interfaces (except non-static nested types)">

</FRAMESET>

<FRAME src="overview-summary.html" name="classFrame" title="Package, class and 
interface descriptions" scrolling="yes">

<NOFRAMES>

<H2>

Frame Alert</H2>



<P>

This document is designed to be viewed using the frames feature. If you see 
this message, you are using a non-frame-capable web client.

<BR>

Link to<A HREF="overview-summary.html">Non-frame version.</A>

</NOFRAMES>

</FRAMESET>

</HTML>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to