Re: [Nutch-general] MS Doc Filter

thomas delnoij Mon, 28 Mar 2005 23:52:26 -0800

What version of Nutch are you using? I am using Nutch
0.6 and the nutch-defalut.xml file contains the
following three entries related to plugins:


<property>
  <name>plugin.folders</name>
  <value>plugins</value>
  <description>Directories where nutch plugins are
located.  Each
  element may be a relative or absolute path.  If
absolute, it is used
  as is.  If relative, it is searched for on the
classpath.</description>
</property>

<property>
  <name>plugin.includes</name> 
<value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.  By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin
directory names to exclude.  
  </description>
</property>

Rgrds, Thomas




--- Eric Money <[EMAIL PROTECTED]> wrote:
> Thanks, I looked in the nutch-defalut.xml and find
> the following property:
> 
> <property>
>   <name>plugin.folder</name>
>   <value>plugins</value>
>   <description>A Directory where nutch plugin are
> located</description>
> </property>
> 
> which is the only thing related with plugins, but I
> did not find the 
> "parse-(text|html)" value.
> 
> Also, should I include the following property:
> 
> <property>
>   <name>urlfilter.regex.file</name>
>   <value>regex-urlfilter.txt</value>
>   <description>Name of file on CLASSPATH containing
> default regular
>   expressions used by RegexURLFilter.</description>
> </property>
> 
> 
> Thanks for your advice.
> 
> 
> On Mon, 28 Mar 2005 11:56:00 -0800 (PST), thomas
> delnoij
> <[EMAIL PROTECTED]> wrote:
> > Hi,
> > 
> > I am new to Nutch as well, so please correct me if
> I
> > am wrong.
> > 
> > > Thanks. Could you please be more specific, how
> to
> > > setup the url filter?
> > 
> > The url filter should be set up in the
> > regex-urlfilter.txt file. As far as I can tell,
> urls
> > ending with the .doc extension are included.
> > 
> > The word parser is installed by updating the
> > nutch-site.xml file. You need to copy the entries
> from
> >  nutch-default.xml that you like to change.
> > 
> > In your case, I think you need to copy the
> > plugin.includes property, and change
> parse-(text|html)
> > to parse-(text|html|msword).
> > 
> > Hope this helps.
> > 
> > Rgrds,
> > 
> > Thomas
> > 
> > 
> > > something like http://mysite.doc? But how can I
> get
> > > all doc files at mysite
> > > if the doc is at http://mysite/1/2/~user/a.doc.
> > >
> > > Is there any reference for word parser? I don't
> know
> > > how to use it, thank you.
> > >
> > >
> > > On Mon, 28 Mar 2005 14:59:57 +0200, Stefan
> Groschupf
> > > <[EMAIL PROTECTED]> wrote:
> > > > Setup a url filter for any *.doc and install
> and
> > > use the word parser,
> > > > that is all you need to do...
> > > >
> > > > Am 28.03.2005 um 07:12 schrieb Eric Money:
> > > >
> > > > > Hi all,
> > > > >
> > > > > If I wanna search a site but only interested
> in
> > > the
> > > > > files with .doc suffix, how should I
> re-write
> > > nutch to
> > > > > get all these files? Any comments and
> > > experiences
> > > > > are appreciated, thanks all in advance.
> > > > >
> > > > >
> > > > >
> > >
> >
>
-------------------------------------------------------
> > > > > SF email is sponsored by - The IT Product
> Guide
> > > > > Read honest & candid reviews on hundreds of
> IT
> > > Products from real
> > > > > users.
> > > > > Discover which products truly live up to the
> > > hype. Start reading now.
> > > > >
> > >
> >
>
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > > > >
> _______________________________________________
> > > > > Nutch-general mailing list
> > > > > [email protected]
> > > > >
> > >
> >
>
https://lists.sourceforge.net/lists/listinfo/nutch-general
> > > > >
> > > > >
> > > >
> > >
> >
>
---------------------------------------------------------------
> > > > company:               
> http://www.media-style.com
> > > > forum:          http://www.text-mining.org
> > > > blog:                   http://www.find23.net
> > > >
> > > >
> > >
> >
>

Re: [Nutch-general] MS Doc Filter

Reply via email to