Re: Filtering content before parsing?

Rafit Izhak_Ratzin Tue, 24 Jan 2006 16:05:56 -0800

Hi,

I would like filter all the MSWord and PDF files after fetching and beforefiltering,

Is there a way to do that ?


Thanks,
Rafit

From: Gal Nitzan <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [email protected]
Subject: Re: Filtering content before parsing?
Date: Tue, 24 Jan 2006 21:55:57 +0200

Create your own parser.

Create a new plaguin parse-chunwei :) for example (just copy the whole
parse-html plugin amd take it from there)

subclass HtmlParser and there you have it.

G.

P.S. Do not forget to replace the parse-html in plugin.include in the
nutch-site.xpl entry.

On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote:

> Is there an easy way to filter content after fetching but beforeparsing?

>
> I'm crawling a site where the information pages includes a form on the
> side, and the option values of the form (which also get sucked into
> the parse.getText() value that I index as "content") is interfering
> with searches on the index. I plan to filter the content and remove
> the form html block before parsing (as per above question). Does
> anyone have another method around this?
>
> Thanks
> CW
>


_________________________________________________________________

Don't just search. Find. Check out the new MSN Search!http://search.msn.com/

Re: Filtering content before parsing?

Reply via email to