Philip Brown wrote:
Andrzej Bialecki wrote:
Philip Brown wrote:
Is it possible on some pages to crawl only between tags or have it not crawl between tags.

ie.

<nocrawl>blah blah blah</nocrawl>
<crawlhere>the content only that I want to crawl</crawlhere>
<nocrawl>blah blah blah</nocrawl>

appreciate any input
kind regards

You can modify DOMContentUtils.java (found in parse-html plugin) to implement this restriction.


Andrzej ,
thanks, i've had a look at DOMContentUtils.java file and it would take me a while to figure it out. however, I thought about putting in the cong/regex-normalizer.xml

<regex>
<pattern>(&lt;donotcrawl&gt;)(.^$*)(&lt;/donotcrawl&gt;)</pattern>
<substitution></substitution>
</regex>

would I need: &amp;lt; - &amp;lt;  in the paterns?

i've tried this to no success at this time. any suggestions.

kind regards,

Phil






ha, after some time trying with the conf/regex-normalizer.xml file... i see that is for url's

I would appreciate any pointers on DOMContentUtils.java

kind regards,
Phlip Brown

Reply via email to