Andrzej Bialecki wrote:
> Philip Brown wrote:
>> Is it possible on some pages to crawl only between tags or have it 
>> not crawl between tags.
>>
>> ie.
>>
>> <nocrawl>blah blah blah</nocrawl>
>> <crawlhere>the content only that I want to crawl</crawlhere>
>> <nocrawl>blah blah blah</nocrawl>
>>
>> appreciate any input
>> kind regards
>
> You can modify DOMContentUtils.java (found in parse-html plugin) to 
> implement this restriction.
>

Andrzej ,
thanks, i've had a look at DOMContentUtils.java file and it would take 
me a while to figure it out. however, I thought about putting in the 
cong/regex-normalizer.xml

<regex>
<pattern>(&lt;donotcrawl&gt;)(.^$*)(&lt;/donotcrawl&gt;)</pattern>
<substitution></substitution>
</regex>

would I need: &amp;lt; - &amp;lt;  in the paterns?

i've tried this to no success at this time. any suggestions.

kind regards,

Phil




-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to