Benjamin,
you could add this to the HtmlParser and BasicIndexingFilter, but maybe
it is best to create your own plugin. Along the lines of the
WritingPluginExample:
1) add a metatag to your seed Pages
<meta name="indexed" content="no" />
2) create a ParseFilter that extends HtmlParseFilter and retrieve the
metatag using
Properties generalMetaTags = metaTags.getGeneralTags();
String indexed = generalMetaTags.getProperty("indexed");
3) add the indexed field to the metadata using
parse.getData().getMetadata().put("indexed", indexed);
4) create an IndexingFilter that retrieves the indexed property from
the metadata using
String indexed = parse.getData().get("indexed");
5) return null if indexed.equals("no")
In your case, implementing a QueryFilter is not necessary, I think.
Does this make sense to you?
Rgrds, Thomas
On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> Okay, that sounds good. Two questions:
>
> * If I don't want to index a document, then from BasicIndexingFilter.filter,
> should I just return the document I receive? Or should I return null? Or
> something else?
>
> * What change(s) do I have to make to HtmlParser? It seems like I can use
> the Parser object as-is, e.g. parse.getData().get("index") to get the
> meta-data value for index. What am I missing?
>
> Thanks for the pointers!
>
> Ben
>
>
> On 4/3/06, TDLN <[EMAIL PROTECTED]> wrote:
> >
> > It depends if you control the seed pages or not; if you do, you could tag
> > them index="no"
> > and skip them during indexing. You would have to change HtmlParser and
> > BasicIndexingFilter.
> >
> > Rgrds, Thomas
> >
> > On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> > >
> > > Hello,
> > >
> > > I've gone through the documentation and tried searching the mailing list
> > > archives. I bet this has come up before, but I just couldn't find
> > > it. So,
> > > if someone could point me to a past discussion that would be great.
> > >
> > > What I want to do is be able to crawl html files for links, but not
> > > actually
> > > index that file. I ask this because I have several seed pages that are
> > > not
> > > meant for human consumption, so I never want them to show up in search
> > > results.
> > >
> > > How can this be accomplished?
> > >
> > > Thanks in advance,
> > >
> > > Ben
> > >
> > >
> >
> >
>
>
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general