Hi,
I would like filter all the MSWord and PDF files after fetching and before filtering,
Is there a way to do that ?

Thanks,
Rafit


From: Gal Nitzan <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [email protected]
Subject: Re: Filtering content before parsing?
Date: Tue, 24 Jan 2006 21:55:57 +0200

Create your own parser.

Create a new plaguin parse-chunwei :) for example (just copy the whole
parse-html plugin amd take it from there)

subclass HtmlParser and there you have it.

G.

P.S. Do not forget to replace the parse-html in plugin.include in the
nutch-site.xpl entry.


On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote:
> Is there an easy way to filter content after fetching but before parsing?
>
> I'm crawling a site where the information pages includes a form on the
> side, and the option values of the form (which also get sucked into
> the parse.getText() value that I index as "content") is interfering
> with searches on the index. I plan to filter the content and remove
> the form html block before parsing (as per above question). Does
> anyone have another method around this?
>
> Thanks
> CW
>



_________________________________________________________________
Don't just search. Find. Check out the new MSN Search! http://search.msn.com/

Reply via email to