[Nutch-general] Re: Crawling a file but not indexing it

Benjamin Higgins Wed, 12 Apr 2006 16:11:01 -0700

Makes sense.  I made these changes.  The trouble is that returning a null
document here causes a null pointer exception later.  Also, I tried just
returning the document reference it gave me, but this also caused a null
pointer exception:


Exception in thread "main" java.lang.RuntimeException:
java.lang.NullPointerException
        at org.apache.nutch.indexer.DeleteDuplicates$2.updateHash(
DeleteDuplicates.java:195)
        at org.apache.nutch.indexer.DeleteDuplicates.computeHashes(
DeleteDuplicates.java:226)
        at org.apache.nutch.indexer.DeleteDuplicates.deleteUrlDuplicates(
DeleteDuplicates.java:189)
        at org.apache.nutch.indexer.DeleteDuplicates.main(
DeleteDuplicates.java:349)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:155)

Presumably because the URL is never set.  Perhaps I could set the URL to the
empty string and that would be sufficient?  But I wanted to post anyway to
see if there is a better way to exclude a document from being indexed at
this stage.  Any ideas?

On 4/4/06, TDLN <[EMAIL PROTECTED]> wrote:
>
> Benjamin,
>
> you could add this to the HtmlParser and BasicIndexingFilter, but maybe
> it is best to create your own plugin. Along the lines of the
> WritingPluginExample:
>
> 1) add a metatag to your seed Pages
> <meta name="indexed" content="no" />
>
> 2) create a ParseFilter that extends HtmlParseFilter and retrieve the
> metatag using
>
> Properties generalMetaTags = metaTags.getGeneralTags();
> String indexed = generalMetaTags.getProperty("indexed");
>
> 3) add the indexed field to the metadata using
>
> parse.getData().getMetadata().put("indexed", indexed);
>
> 4) create an IndexingFilter that retrieves the indexed property from
> the metadata using
>
> String indexed = parse.getData().get("indexed");
>
> 5) return null if indexed.equals("no")
>
> In  your case, implementing a QueryFilter is not necessary, I think.
>
> Does this make sense to you?
>
> Rgrds, Thomas
>
>
> On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> > Okay, that sounds good.  Two questions:
> >
> > * If I don't want to index a document, then from
> BasicIndexingFilter.filter,
> > should I just return the document I receive?  Or should I return
> null?  Or
> > something else?
> >
> > * What change(s) do I have to make to HtmlParser?  It seems like I can
> use
> > the Parser object as-is, e.g. parse.getData().get("index") to get the
> > meta-data value for index.  What am I missing?
> >
> > Thanks for the pointers!
> >
> > Ben
> >
> >
> > On 4/3/06, TDLN <[EMAIL PROTECTED]> wrote:
> > >
> > > It depends if you control the seed pages or not; if you do, you could
> tag
> > > them index="no"
> > > and skip them during indexing. You would have to change HtmlParser and
> > > BasicIndexingFilter.
> > >
> > > Rgrds, Thomas
> > >
> > > On 4/4/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I've gone through the documentation and tried searching the mailing
> list
> > > > archives.  I bet this has come up before, but I just couldn't find
> > > > it.  So,
> > > > if someone could point me to a past discussion that would be great.
> > > >
> > > > What I want to do is be able to crawl html files for links, but not
> > > > actually
> > > > index that file.  I ask this because I have several seed pages that
> are
> > > > not
> > > > meant for human consumption, so I never want them to show up in
> search
> > > > results.
> > > >
> > > > How can this be accomplished?
> > > >
> > > > Thanks in advance,
> > > >
> > > > Ben
> > > >
> > > >
> > >
> > >
> >
> >
>

[Nutch-general] Re: Crawling a file but not indexing it

Reply via email to