Philip Brown wrote: > Andrzej Bialecki wrote: >> Philip Brown wrote: >>> Is it possible on some pages to crawl only between tags or have it >>> not crawl between tags. >>> >>> ie. >>> >>> <nocrawl>blah blah blah</nocrawl> >>> <crawlhere>the content only that I want to crawl</crawlhere> >>> <nocrawl>blah blah blah</nocrawl> >>> >>> appreciate any input >>> kind regards >> >> You can modify DOMContentUtils.java (found in parse-html plugin) to >> implement this restriction. >> > > Andrzej , > thanks, i've had a look at DOMContentUtils.java file and it would take > me a while to figure it out. however, I thought about putting in the > cong/regex-normalizer.xml > > <regex> > <pattern>(<donotcrawl>)(.^$*)(</donotcrawl>)</pattern> > <substitution></substitution> > </regex> > > would I need: &lt; - &lt; in the paterns? > > i've tried this to no success at this time. any suggestions. > > kind regards, > > Phil > > > > > > ha, after some time trying with the conf/regex-normalizer.xml file... i see that is for url's
I would appreciate any pointers on DOMContentUtils.java kind regards, Phlip Brown ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
