Andrzej Bialecki wrote: > Philip Brown wrote: >> Is it possible on some pages to crawl only between tags or have it >> not crawl between tags. >> >> ie. >> >> <nocrawl>blah blah blah</nocrawl> >> <crawlhere>the content only that I want to crawl</crawlhere> >> <nocrawl>blah blah blah</nocrawl> >> >> appreciate any input >> kind regards > > You can modify DOMContentUtils.java (found in parse-html plugin) to > implement this restriction. >
Andrzej , thanks, i've had a look at DOMContentUtils.java file and it would take me a while to figure it out. however, I thought about putting in the cong/regex-normalizer.xml <regex> <pattern>(<donotcrawl>)(.^$*)(</donotcrawl>)</pattern> <substitution></substitution> </regex> would I need: &lt; - &lt; in the paterns? i've tried this to no success at this time. any suggestions. kind regards, Phil ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
