Hi All,
I am new to nutch. It is really wonderful. I am stuck with headings
plugin. My nutch-site.xml reads as below,
<configuration>
<property>
<name>http.agent.name</name>
<value>Airpush Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>headings|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>activates metatag parsing </description>
</property>
<property>
<name>headings</name>
<value>h1,h2</value>
<description>Comma separated list of headings to retrieve from
the document</description>
</property>
<property>
<name>headings.multivalued</name>
<value>false</value>
<description>Whether to support multivalued
headings.</description>
</property>
</configuration>
When I dump data from segments, I am getting entire html data. Shouldnot it
be just headings read from crawling. Why am I getting entire data?
Please help me. Thanks in advance.
Regards,
K Krishna Chaitanya