should I remove text|html from parse-(text|html|msword) and
leave only parse-(msword),  'cause I am only interested in msword
files on that site.


On Mon, 28 Mar 2005 23:52:19 -0800 (PST), thomas delnoij
<[EMAIL PROTECTED]> wrote:
> What version of Nutch are you using? I am using Nutch
> 0.6 and the nutch-defalut.xml file contains the
> following three entries related to plugins:
> 
> <property>
>   <name>plugin.folders</name>
>   <value>plugins</value>
>   <description>Directories where nutch plugins are
> located.  Each
>   element may be a relative or absolute path.  If
> absolute, it is used
>   as is.  If relative, it is searched for on the
> classpath.</description>
> </property>
> 
> <property>
>   <name>plugin.includes</name>
> <value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>   <description>Regular expression naming plugin
> directory names to
>   include.  Any plugin not matching this expression is
> excluded.  By
>   default Nutch includes crawling just HTML and plain
> text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> 
> <property>
>   <name>plugin.excludes</name>
>   <value></value>
>   <description>Regular expression naming plugin
> directory names to exclude.
>   </description>
> </property>
> 
> Rgrds, Thomas
> 
> 
> --- Eric Money <[EMAIL PROTECTED]> wrote:
> > Thanks, I looked in the nutch-defalut.xml and find
> > the following property:
> >
> > <property>
> >   <name>plugin.folder</name>
> >   <value>plugins</value>
> >   <description>A Directory where nutch plugin are
> > located</description>
> > </property>
> >
> > which is the only thing related with plugins, but I
> > did not find the
> > "parse-(text|html)" value.
> >
> > Also, should I include the following property:
> >
> > <property>
> >   <name>urlfilter.regex.file</name>
> >   <value>regex-urlfilter.txt</value>
> >   <description>Name of file on CLASSPATH containing
> > default regular
> >   expressions used by RegexURLFilter.</description>
> > </property>
> >
> >
> > Thanks for your advice.
> >
> >
> > On Mon, 28 Mar 2005 11:56:00 -0800 (PST), thomas
> > delnoij
> > <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > I am new to Nutch as well, so please correct me if
> > I
> > > am wrong.
> > >
> > > > Thanks. Could you please be more specific, how
> > to
> > > > setup the url filter?
> > >
> > > The url filter should be set up in the
> > > regex-urlfilter.txt file. As far as I can tell,
> > urls
> > > ending with the .doc extension are included.
> > >
> > > The word parser is installed by updating the
> > > nutch-site.xml file. You need to copy the entries
> > from
> > >  nutch-default.xml that you like to change.
> > >
> > > In your case, I think you need to copy the
> > > plugin.includes property, and change
> > parse-(text|html)
> > > to parse-(text|html|msword).
> > >
> > > Hope this helps.
> > >
> > > Rgrds,
> > >
> > > Thomas
> > >
> > >
> > > > something like http://mysite.doc? But how can I
> > get
> > > > all doc files at mysite
> > > > if the doc is at http://mysite/1/2/~user/a.doc.
> > > >
> > > > Is there any reference for word parser? I don't
> > know
> > > > how to use it, thank you.
> > > >
> > > >
> > > > On Mon, 28 Mar 2005 14:59:57 +0200, Stefan
> > Groschupf
> > > > <[EMAIL PROTECTED]> wrote:
> > > > > Setup a url filter for any *.doc and install
> > and
> > > > use the word parser,
> > > > > that is all you need to do...
> > > > >
> > > > > Am 28.03.2005 um 07:12 schrieb Eric Money:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > If I wanna search a site but only interested
> > in
> > > > the
> > > > > > files with .doc suffix, how should I
> > re-write
> > > > nutch to
> > > > > > get all these files? Any comments and
> > > > experiences
> > > > > > are appreciated, thanks all in advance.
> > > > > >
> > > > > >
> > > > > >
> > > >
> > >
> >
> -------------------------------------------------------
> > > > > > SF email is sponsored by - The IT Product
> > Guide
> > > > > > Read honest & candid reviews on hundreds of
> > IT
> > > > Products from real
> > > > > > users.
> > > > > > Discover which products truly live up to the
> > > > hype. Start reading now.
> > > > > >
> > > >
> > >
> >
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > > > > >
> > _______________________________________________
> > > > > > Nutch-general mailing list
> > > > > > [email protected]
> > > > > >
> > > >
> > >
> >
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ---------------------------------------------------------------
> > > > > company:
> > http://www.media-style.com
> > > > > forum:          http://www.text-mining.org
> > > > > blog:                   http://www.find23.net
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to