Re: [Nutch-general] focussed crawling

Apache Lucene Wed, 04 Oct 2006 09:06:01 -0700

The problem with keeping everything is, when I do a recrawl I dont want to
process all the links. The number of relevant links might be negligible
compared to the number of irrelevant links.


On 10/4/06, Jim Wilson <[EMAIL PROTECTED]> wrote:


It seems like you can probably keep everything in the Links and
Pages/Content - as long as it doesn't end up in the Index, right?  If so,
you may only need to modify the indexer and leave the rest alone.

(I may not understand the problem though, and I'm certainly not a Nutch
expert - just trying to help :)

-- Jim

On 10/4/06, Apache Lucene <[EMAIL PROTECTED]> wrote:
>
> On 10/3/06, Jim Wilson <[EMAIL PROTECTED]> wrote:
> >
> > PageRank is a Trademark of Google, and a source of great revenue for
> them
> > -
> > you'll have to call it something else. :(
>
>
> I was referring to the link score nutch assigns.
>
> Determining whether a page is relevant to a topic (with any degree of
> > accuracy) is a harder problem that it may appear - though your opening
> > post
> > says to assume you have a way to do it.  For example, suppose the word
> > "buffalo" appears on a page.  Does this mean the animal, or the NFL
> team,
> > the city, or the spicy sauce?
>
>
> My case is fairly simple. The case you are mentioning is definitely a
> harder
> problem.
>
> Another concern is the assumption that documents linked-to by relevant
> > documents are themselves at all relevant.  Take Wikipedia for example
-
> > there are lots of links on every page that have nothing to do with the
> > article (such as Main Page, Community Portal, Privacy Policy,
etc).  If
> N
> > is
> > any more than 1 or 2, you'll probably be swamped with non-relevant
pages
>
>
> The reason I wanted this architecture was to get as many links as
> possible.
> The link database would have the non-relevant links however the final
> index
> would not. The non-relevant links will be used only for getting the
links
> for more relevant links (if any). At the end of first crawl the database
> will have the following:
>
> Links: Relevant links + Small number of non-relevant links
> Pages/Content: Relevant pages only
> Index: Relevant pages only
>
> In the subsequent recrawling I would just have to visit recrawl a small
> subset as opposed to the entire links had I stored all the links. If I
> dont
> store the non-relevant links then I might not be able to get the
relevant
> links which might be in the non-relevant pages. I am willing to have
false
> positives in the database but keeping them low by setting N
appropriately.
> I
> hope this makes sense. Maybe the term "focussed crawling" was wrong..:)
>
>
> > .In researching the problem, you might want to check out the Carrot
> > Clustering Engine (http://demo.carrot-search.com/carrot2-webapp/main
> ).  It
> > may do what you want OOTB.
>
>
> If I am right, Carrot is useful in clustering the search results and
> helpful
> in crawling. I will take a fresh look at it for crawling.
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] focussed crawling

Reply via email to