Philip Brown wrote:
> Andrzej Bialecki wrote:
>> Philip Brown wrote:
>>> Is it possible on some pages to crawl only between tags or have it 
>>> not crawl between tags.
>>>
>>> ie.
>>>
>>> <nocrawl>blah blah blah</nocrawl>
>>> <crawlhere>the content only that I want to crawl</crawlhere>
>>> <nocrawl>blah blah blah</nocrawl>
>>>
>>> appreciate any input
>>> kind regards
>>
>> You can modify DOMContentUtils.java (found in parse-html plugin) to 
>> implement this restriction.
>>
>
> Andrzej ,
> thanks, i've had a look at DOMContentUtils.java file and it would take 
> me a while to figure it out. however, I thought about putting in the 
> cong/regex-normalizer.xml
>
> <regex>
> <pattern>(&lt;donotcrawl&gt;)(.^$*)(&lt;/donotcrawl&gt;)</pattern>
> <substitution></substitution>
> </regex>
>
> would I need: &amp;lt; - &amp;lt;  in the paterns?
>
> i've tried this to no success at this time. any suggestions.
>
> kind regards,
>
> Phil
>
>
>
>
>
>
ha, after some time trying with the conf/regex-normalizer.xml  file... i 
see that is for url's

I would appreciate any pointers on DOMContentUtils.java

kind regards,
Phlip Brown

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to