Okay, that sounds good. Two questions:
* If I don't want to index a document, then from BasicIndexingFilter.filter,
should I just return the document I receive? Or should I return null? Or
something else?
* What change(s) do I have to make to HtmlParser? It seems like I can use
the Parser object as-is, e.g. parse.getData().get("index") to get the
meta-data value for index. What am I missing?
Thanks for the pointers!
Ben
On 4/3/06, TDLN <[EMAIL PROTECTED]> wrote:
>
> It depends if you control the seed pages or not; if you do, you could tag
> them index="no"
> and skip them during indexing. You would have to change HtmlParser and
> BasicIndexingFilter.
>
> Rgrds, Thomas
>
> On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
> >
> > I've gone through the documentation and tried searching the mailing list
> > archives. I bet this has come up before, but I just couldn't find
> > it. So,
> > if someone could point me to a past discussion that would be great.
> >
> > What I want to do is be able to crawl html files for links, but not
> > actually
> > index that file. I ask this because I have several seed pages that are
> > not
> > meant for human consumption, so I never want them to show up in search
> > results.
> >
> > How can this be accomplished?
> >
> > Thanks in advance,
> >
> > Ben
> >
> >
>
>