Hi All,
       I am new to nutch. It is really wonderful. I am stuck with headings
plugin. My nutch-site.xml reads as below,
<configuration>
    <property>
         <name>http.agent.name</name>
         <value>Airpush Spider</value>
    </property>

        <property>
            <name>plugin.includes</name>

<value>headings|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
            <description>activates metatag parsing </description>
    </property>

    <property>
          <name>headings</name>
          <value>h1,h2</value>
          <description>Comma separated list of headings to retrieve from
the document</description>
    </property>

    <property>
          <name>headings.multivalued</name>
          <value>false</value>
          <description>Whether to support multivalued
headings.</description>
    </property>



</configuration>


When I dump data from segments, I am getting entire html data. Shouldnot it
be just headings read from crawling. Why am I getting entire data?
              Please help me. Thanks in advance.

Regards,
K Krishna Chaitanya

Reply via email to