Re: Only indexing pages meeting certain criteria

Marcin Okraszewski Thu, 08 Oct 2009 15:18:12 -0700

The modified command is mergeseg. It has an option "-filter" which filters
merged segments. Normal nuch performs just filtering by URLs. With the patch
there is also an additional extension point, which allows filtering by
content of any part of segment being merged.


http://wiki.apache.org/nutch/bin/nutch_mergesegs

If you specify segments one by one, you need to provide at least 2. But
there is a handy option "-dir" which specifies a directory which contains
all segments (subdirs) to merge. But if there is just one segment (subdir)
inside, it will actually only filter, since there is nothing to merge :)

Marcin


On Thu, Oct 8, 2009 at 10:31 PM, BELLINI ADAM <mbel...@msn.com> wrote:

>
> Marcin,
>
> can you tell us how did you delete URLS form segments before you merge them
> ??
> i mean how did you filter segment ??
>
>
>
> > From: okrasz_n...@o2.pl
> > Date: Thu, 8 Oct 2009 22:18:50 +0200
> > Subject: Re: Only indexing pages meeting certain criteria
> > To: nutch-user@lucene.apache.org
> >
> > I have achieved it by filtering segment before indexing. You can do it
> while
> > merging segments ... actually you can merge just one segment, so you
> simply
> > filter then. If you are able to filter by URL then you can do it right
> ahead
> > with Nutch 0.9 or 1.0. If you need page content or some metadata
> extracted
> > during parsing, you would need to apply this patch
> >
> > https://issues.apache.org/jira/browse/NUTCH-677
> >
> > and provide your filtering extension.
> >
> > Regards,
> > Marcin
> >
> > 2009/10/8 Magnús Skúlason <magg...@gmail.com>
> >
> > > Hi,
> > > I want nutch to only index some of the documents that it crawls, I have
> > > tried what is suggested here:
> > > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11649.html
> > >
> > > That is in an IndexingFilter I check for the condition whether to index
> the
> > > document and if not I return null.
> > >
> > > When I then run the crawl I get the following error:
> > > Exception in thread "main" java.io.IOException: Job failed!
> > >        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> > >        at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
> > >        at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)
> > >
> > > I am on nutch 0.9 few months older than the date in the original post,
> does
> > > anyone know what I might be doing wrong or why this is not working any
> > > more?
> > > If this has changed can anyone tell me how I can do this?
> > >
> > > best regards,
> > > Magnus
> > >
>
> _________________________________________________________________
> Click less, chat more: Messenger on MSN.ca
> http://go.microsoft.com/?linkid=9677404
>

Re: Only indexing pages meeting certain criteria

Reply via email to