[Nutch-general] Re: Crawling a file but not indexing it

TDLN Mon, 03 Apr 2006 22:24:24 -0700

It depends if you control the seed pages or not; if you do, you could tag
them index="no"
and skip them during indexing. You would have to change HtmlParser and
BasicIndexingFilter.


Rgrds, Thomas

On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I've gone through the documentation and tried searching the mailing list
> archives.  I bet this has come up before, but I just couldn't find
> it.  So,
> if someone could point me to a past discussion that would be great.
>
> What I want to do is be able to crawl html files for links, but not
> actually
> index that file.  I ask this because I have several seed pages that are
> not
> meant for human consumption, so I never want them to show up in search
> results.
>
> How can this be accomplished?
>
> Thanks in advance,
>
> Ben
>
>

[Nutch-general] Re: Crawling a file but not indexing it

Reply via email to