[Nutch-general] Re: Crawling a file but not indexing it

Benjamin Higgins Tue, 04 Apr 2006 11:17:38 -0700

Okay, that sounds good.  Two questions:

* If I don't want to index a document, then from BasicIndexingFilter.filter,
should I just return the document I receive?  Or should I return null?  Or
something else?


* What change(s) do I have to make to HtmlParser?  It seems like I can use
the Parser object as-is, e.g. parse.getData().get("index") to get the
meta-data value for index.  What am I missing?

Thanks for the pointers!

Ben


On 4/3/06, TDLN <[EMAIL PROTECTED]> wrote:
>
> It depends if you control the seed pages or not; if you do, you could tag
> them index="no"
> and skip them during indexing. You would have to change HtmlParser and
> BasicIndexingFilter.
>
> Rgrds, Thomas
>
> On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
> >
> > I've gone through the documentation and tried searching the mailing list
> > archives.  I bet this has come up before, but I just couldn't find
> > it.  So,
> > if someone could point me to a past discussion that would be great.
> >
> > What I want to do is be able to crawl html files for links, but not
> > actually
> > index that file.  I ask this because I have several seed pages that are
> > not
> > meant for human consumption, so I never want them to show up in search
> > results.
> >
> > How can this be accomplished?
> >
> > Thanks in advance,
> >
> > Ben
> >
> >
>
>

[Nutch-general] Re: Crawling a file but not indexing it

Reply via email to