Sorry I ment that
I would like to filter all the MSWord and PDF files after fetching and
before parsing.
Thanks ,
Rafit
From: "Rafit Izhak_Ratzin" <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Filtering content before parsing?
Date: Wed, 25 Jan 2006 00:05:18 +0000
Hi,
I would like filter all the MSWord and PDF files after fetching and before
filtering,
Is there a way to do that ?
Thanks,
Rafit
From: Gal Nitzan <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [email protected]
Subject: Re: Filtering content before parsing?
Date: Tue, 24 Jan 2006 21:55:57 +0200
Create your own parser.
Create a new plaguin parse-chunwei :) for example (just copy the whole
parse-html plugin amd take it from there)
subclass HtmlParser and there you have it.
G.
P.S. Do not forget to replace the parse-html in plugin.include in the
nutch-site.xpl entry.
On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote:
> Is there an easy way to filter content after fetching but before
parsing?
>
> I'm crawling a site where the information pages includes a form on the
> side, and the option values of the form (which also get sucked into
> the parse.getText() value that I index as "content") is interfering
> with searches on the index. I plan to filter the content and remove
> the form html block before parsing (as per above question). Does
> anyone have another method around this?
>
> Thanks
> CW
>
_________________________________________________________________
Don't just search. Find. Check out the new MSN Search!
http://search.msn.com/
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/