RE: [Nutch-general] MS Doc Filter

Chirag Chaman Mon, 28 Mar 2005 14:47:29 -0800

Eric:

The plugs are loaded dynamically. The "plugins" is the top level directory
where the plugins are located.


Yes, you should include the urlfilter.regex.file property.

This specifies the URL you want to parse or ignore.  Here's the format

# This is a comment
-\.gif # this will ignore files urls have .gif in them
+nutch.org # this will include all url that contain nutch.org

So, make sure that .doc is not on a "-" line


 

-----Original Message-----
From: Eric Money [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 28, 2005 5:31 PM
To: [email protected]
Subject: Re: [Nutch-general] MS Doc Filter

Thanks, I looked in the nutch-defalut.xml and find the following property:

<property>
  <name>plugin.folder</name>
  <value>plugins</value>
  <description>A Directory where nutch plugin are located</description>
</property>

which is the only thing related with plugins, but I did not find the
"parse-(text|html)" value.

Also, should I include the following property:

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing default regular
  expressions used by RegexURLFilter.</description> </property>


Thanks for your advice.


On Mon, 28 Mar 2005 11:56:00 -0800 (PST), thomas delnoij <[EMAIL PROTECTED]>
wrote:
> Hi,
> 
> I am new to Nutch as well, so please correct me if I am wrong.
> 
> > Thanks. Could you please be more specific, how to setup the url 
> > filter?
> 
> The url filter should be set up in the regex-urlfilter.txt file. As 
> far as I can tell, urls ending with the .doc extension are included.
> 
> The word parser is installed by updating the nutch-site.xml file. You 
> need to copy the entries from  nutch-default.xml that you like to 
> change.
> 
> In your case, I think you need to copy the plugin.includes property, 
> and change parse-(text|html) to parse-(text|html|msword).
> 
> Hope this helps.
> 
> Rgrds,
> 
> Thomas
> 
> 
> > something like http://mysite.doc? But how can I get all doc files at 
> > mysite if the doc is at http://mysite/1/2/~user/a.doc.
> >
> > Is there any reference for word parser? I don't know how to use it, 
> > thank you.
> >
> >
> > On Mon, 28 Mar 2005 14:59:57 +0200, Stefan Groschupf 
> > <[EMAIL PROTECTED]> wrote:
> > > Setup a url filter for any *.doc and install and
> > use the word parser,
> > > that is all you need to do...
> > >
> > > Am 28.03.2005 um 07:12 schrieb Eric Money:
> > >
> > > > Hi all,
> > > >
> > > > If I wanna search a site but only interested in
> > the
> > > > files with .doc suffix, how should I re-write
> > nutch to
> > > > get all these files? Any comments and
> > experiences
> > > > are appreciated, thanks all in advance.
> > > >
> > > >
> > > >
> >
> -------------------------------------------------------
> > > > SF email is sponsored by - The IT Product Guide Read honest & 
> > > > candid reviews on hundreds of IT
> > Products from real
> > > > users.
> > > > Discover which products truly live up to the
> > hype. Start reading now.
> > > >
> >
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > > > _______________________________________________
> > > > Nutch-general mailing list
> > > > [email protected]
> > > >
> >
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> > > >
> > > >
> > >
> >
> ---------------------------------------------------------------
> > > company:                http://www.media-style.com
> > > forum:          http://www.text-mining.org
> > > blog:                   http://www.find23.net
> > >
> > >
> >
>

RE: [Nutch-general] MS Doc Filter

Reply via email to