Hi,

I am new to Nutch as well, so please correct me if I
am wrong.

> Thanks. Could you please be more specific, how to
> setup the url filter?

The url filter should be set up in the
regex-urlfilter.txt file. As far as I can tell, urls
ending with the .doc extension are included.

The word parser is installed by updating the
nutch-site.xml file. You need to copy the entries from
 nutch-default.xml that you like to change.

In your case, I think you need to copy the
plugin.includes property, and change parse-(text|html)
to parse-(text|html|msword).

Hope this helps.

Rgrds,

Thomas


> something like http://mysite.doc? But how can I get
> all doc files at mysite
> if the doc is at http://mysite/1/2/~user/a.doc.
> 
> Is there any reference for word parser? I don't know
> how to use it, thank you.
> 
> 
> On Mon, 28 Mar 2005 14:59:57 +0200, Stefan Groschupf
> <[EMAIL PROTECTED]> wrote:
> > Setup a url filter for any *.doc and install and
> use the word parser,
> > that is all you need to do...
> > 
> > Am 28.03.2005 um 07:12 schrieb Eric Money:
> > 
> > > Hi all,
> > >
> > > If I wanna search a site but only interested in
> the
> > > files with .doc suffix, how should I re-write
> nutch to
> > > get all these files? Any comments and
> experiences
> > > are appreciated, thanks all in advance.
> > >
> > >
> > >
>
-------------------------------------------------------
> > > SF email is sponsored by - The IT Product Guide
> > > Read honest & candid reviews on hundreds of IT
> Products from real
> > > users.
> > > Discover which products truly live up to the
> hype. Start reading now.
> > >
>
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > > _______________________________________________
> > > Nutch-general mailing list
> > > [email protected]
> > >
>
https://lists.sourceforge.net/lists/listinfo/nutch-general
> > >
> > >
> >
>
---------------------------------------------------------------
> > company:                http://www.media-style.com
> > forum:          http://www.text-mining.org
> > blog:                   http://www.find23.net
> > 
> >
> 

Reply via email to