Re: Filtering content before parsing?

Rafit Izhak_Ratzin Tue, 24 Jan 2006 16:08:49 -0800

Sorry I ment that

I would like to filter all the MSWord and PDF files after fetching andbefore parsing.


Thanks ,
Rafit

From: "Rafit Izhak_Ratzin" <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Filtering content before parsing?
Date: Wed, 25 Jan 2006 00:05:18 +0000

Hi,

I would like filter all the MSWord and PDF files after fetching and beforefiltering,

Is there a way to do that ?

Thanks,
Rafit

From: Gal Nitzan <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [email protected]
Subject: Re: Filtering content before parsing?
Date: Tue, 24 Jan 2006 21:55:57 +0200

Create your own parser.

Create a new plaguin parse-chunwei :) for example (just copy the whole
parse-html plugin amd take it from there)

subclass HtmlParser and there you have it.

G.

P.S. Do not forget to replace the parse-html in plugin.include in the
nutch-site.xpl entry.

On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote:

> Is there an easy way to filter content after fetching but beforeparsing?

>
> I'm crawling a site where the information pages includes a form on the
> side, and the option values of the form (which also get sucked into
> the parse.getText() value that I index as "content") is interfering
> with searches on the index. I plan to filter the content and remove
> the form html block before parsing (as per above question). Does
> anyone have another method around this?
>
> Thanks
> CW
>


_________________________________________________________________

Don't just search. Find. Check out the new MSN Search!http://search.msn.com/


_________________________________________________________________

Express yourself instantly with MSN Messenger! Download today it's FREE!http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Filtering content before parsing?

Reply via email to