Hi, I am new to Nutch as well, so please correct me if I am wrong.
> Thanks. Could you please be more specific, how to > setup the url filter? The url filter should be set up in the regex-urlfilter.txt file. As far as I can tell, urls ending with the .doc extension are included. The word parser is installed by updating the nutch-site.xml file. You need to copy the entries from nutch-default.xml that you like to change. In your case, I think you need to copy the plugin.includes property, and change parse-(text|html) to parse-(text|html|msword). Hope this helps. Rgrds, Thomas > something like http://mysite.doc? But how can I get > all doc files at mysite > if the doc is at http://mysite/1/2/~user/a.doc. > > Is there any reference for word parser? I don't know > how to use it, thank you. > > > On Mon, 28 Mar 2005 14:59:57 +0200, Stefan Groschupf > <[EMAIL PROTECTED]> wrote: > > Setup a url filter for any *.doc and install and > use the word parser, > > that is all you need to do... > > > > Am 28.03.2005 um 07:12 schrieb Eric Money: > > > > > Hi all, > > > > > > If I wanna search a site but only interested in > the > > > files with .doc suffix, how should I re-write > nutch to > > > get all these files? Any comments and > experiences > > > are appreciated, thanks all in advance. > > > > > > > > > > ------------------------------------------------------- > > > SF email is sponsored by - The IT Product Guide > > > Read honest & candid reviews on hundreds of IT > Products from real > > > users. > > > Discover which products truly live up to the > hype. Start reading now. > > > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > > _______________________________________________ > > > Nutch-general mailing list > > > [email protected] > > > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > > > > > --------------------------------------------------------------- > > company: http://www.media-style.com > > forum: http://www.text-mining.org > > blog: http://www.find23.net > > > > >
