Hi,

What you ask for is not possible using the prune command. Prune is to remove
URLs that follow a specific patter specified by the administrator.

You will need to parse the HTML page so that the "unwanted" portions
mentioned by you i.e. <div class="menu"...> do not get included in the
CONTENT field of the index that gets created. I don't think this ability
exists in Nutch 0.9 (not even in 1.0). You will need to write your own code
for that. We did that and it works well i.e. only certain portions of the
page are included in the index.

Thanks,
-sroy

On Mon, Nov 9, 2009 at 9:09 PM, Annappa <annappa...@yahoo.co.in> wrote:

>
> Hi,
>
> I am unsing Nutch-0.9 for crawing of  sime web application which has a
> header part, menu part , left navigation and main contetn area.
>
> When i do a search on a perticular key word and if that appears in the main
> menu, then results are repeating as many times as  pages are,  bcz the menu
> will be included in all the pages. So i need to restrict my search not to
> search with the content of a perticular div
>
> ex : <div class="menu"> ................   </div>.
>
>
> Ho do i remove the content between a div from a search
>
> --
> View this message in context:
> http://old.nabble.com/PRUNE-%3A-need-some-help-on-pruning-syntax.-tp26268447p26268447.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in

Reply via email to