should I remove text|html from parse-(text|html|msword) and leave only parse-(msword), 'cause I am only interested in msword files on that site.
On Mon, 28 Mar 2005 23:52:19 -0800 (PST), thomas delnoij <[EMAIL PROTECTED]> wrote: > What version of Nutch are you using? I am using Nutch > 0.6 and the nutch-defalut.xml file contains the > following three entries related to plugins: > > <property> > <name>plugin.folders</name> > <value>plugins</value> > <description>Directories where nutch plugins are > located. Each > element may be a relative or absolute path. If > absolute, it is used > as is. If relative, it is searched for on the > classpath.</description> > </property> > > <property> > <name>plugin.includes</name> > <value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value> > <description>Regular expression naming plugin > directory names to > include. Any plugin not matching this expression is > excluded. By > default Nutch includes crawling just HTML and plain > text via HTTP, > and basic indexing and search plugins. > </description> > </property> > > <property> > <name>plugin.excludes</name> > <value></value> > <description>Regular expression naming plugin > directory names to exclude. > </description> > </property> > > Rgrds, Thomas > > > --- Eric Money <[EMAIL PROTECTED]> wrote: > > Thanks, I looked in the nutch-defalut.xml and find > > the following property: > > > > <property> > > <name>plugin.folder</name> > > <value>plugins</value> > > <description>A Directory where nutch plugin are > > located</description> > > </property> > > > > which is the only thing related with plugins, but I > > did not find the > > "parse-(text|html)" value. > > > > Also, should I include the following property: > > > > <property> > > <name>urlfilter.regex.file</name> > > <value>regex-urlfilter.txt</value> > > <description>Name of file on CLASSPATH containing > > default regular > > expressions used by RegexURLFilter.</description> > > </property> > > > > > > Thanks for your advice. > > > > > > On Mon, 28 Mar 2005 11:56:00 -0800 (PST), thomas > > delnoij > > <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > I am new to Nutch as well, so please correct me if > > I > > > am wrong. > > > > > > > Thanks. Could you please be more specific, how > > to > > > > setup the url filter? > > > > > > The url filter should be set up in the > > > regex-urlfilter.txt file. As far as I can tell, > > urls > > > ending with the .doc extension are included. > > > > > > The word parser is installed by updating the > > > nutch-site.xml file. You need to copy the entries > > from > > > nutch-default.xml that you like to change. > > > > > > In your case, I think you need to copy the > > > plugin.includes property, and change > > parse-(text|html) > > > to parse-(text|html|msword). > > > > > > Hope this helps. > > > > > > Rgrds, > > > > > > Thomas > > > > > > > > > > something like http://mysite.doc? But how can I > > get > > > > all doc files at mysite > > > > if the doc is at http://mysite/1/2/~user/a.doc. > > > > > > > > Is there any reference for word parser? I don't > > know > > > > how to use it, thank you. > > > > > > > > > > > > On Mon, 28 Mar 2005 14:59:57 +0200, Stefan > > Groschupf > > > > <[EMAIL PROTECTED]> wrote: > > > > > Setup a url filter for any *.doc and install > > and > > > > use the word parser, > > > > > that is all you need to do... > > > > > > > > > > Am 28.03.2005 um 07:12 schrieb Eric Money: > > > > > > > > > > > Hi all, > > > > > > > > > > > > If I wanna search a site but only interested > > in > > > > the > > > > > > files with .doc suffix, how should I > > re-write > > > > nutch to > > > > > > get all these files? Any comments and > > > > experiences > > > > > > are appreciated, thanks all in advance. > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > > > SF email is sponsored by - The IT Product > > Guide > > > > > > Read honest & candid reviews on hundreds of > > IT > > > > Products from real > > > > > > users. > > > > > > Discover which products truly live up to the > > > > hype. Start reading now. > > > > > > > > > > > > > > > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > > > > > > > _______________________________________________ > > > > > > Nutch-general mailing list > > > > > > [email protected] > > > > > > > > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------- > > > > > company: > > http://www.media-style.com > > > > > forum: http://www.text-mining.org > > > > > blog: http://www.find23.net > > > > > > > > > > > > > > > > > > > >
