Re: [Nutch-general] incremental crawling

Meryl Silverburgh Thu, 19 Apr 2007 08:04:53 -0700

I don't know. But how can I find out if that is the case and how can I fix it?


On 4/19/07, RP <[EMAIL PROTECTED]> wrote:
> I've not looked but do they have a robots.txt file or META tag set that
> may be stopping things..??
>
> rp
>
> Meryl Silverburgh wrote:
> > All,
> >
> > Can you please help me with my problem? I have posted my question a
> > few time, but I still cant solve it. I appreciate if anyone can help
> > me with that.
> >
> > i am trying to setup nutch 0.9 to crawl www.yahoo.com (My setting
> > works for cnn.com, msn.com, but not yahoo.com) .
> > I am using this command "bin/nutch crawl urls -dir crawl -depth 3".
> >
> > But after the command, no links have been fetch.
> >
> > the only strange thing I see in the hadoop log is this warning:
> >
> > 2007-04-16 23:22:48,062 WARN  regex.RegexURLNormalizer - can't find
> > rules for scope 'outlink', using default
> >
> > Is that something I need to setup before www.yahoo.com can be crawled?
> >
> > Here is the output:
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 3
> > Injector: starting
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment: crawl/segments/20070416230326
> > Generator: filtering: false
> > Generator: topN: 2147483647
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls by host, for politeness.
> > Generator: done.
> > Fetcher: starting
> > Fetcher: segment: crawl/segments/20070416230326
> > Fetcher: threads: 10
> > fetching http://www.yahoo.com/
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20070416230326]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment: crawl/segments/20070416230338
> > Generator: filtering: false
> > Generator: topN: 2147483647
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=1 - no more URLs to fetch.
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment: crawl/segments/20070416230326
> > LinkDb: done
> > Indexer: starting
> > Indexer: linkdb: crawl/linkdb
> > Indexer: adding segment: crawl/segments/20070416230326
> > Indexing [http://www.yahoo.com/] with analyzer
> > [EMAIL PROTECTED] (null)
> > Optimizing index.
> > merging segments _ram_0 (1 docs) into _0 (1 docs)
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl/indexes
> > Dedup: done
> > merging indexes to: crawl/index
> > Adding crawl/indexes/part-00000
> > done merging
> > crawl finished: crawl
> > CrawlDb topN: starting (topN=25, min=0.0)
> > CrawlDb db: crawl/crawldb
> > CrawlDb topN: collecting topN scores.
> > CrawlDb topN: done
> > Match
> >
> > On 4/18/07, c wanek <[EMAIL PROTECTED]> wrote:
> >> Thanks Raj,
> >>
> >> First of all, here's some info I didn't include in the original
> >> question;
> >> I'm using Nutch .9, and my attempt to add to my index is basically a
> >> variation of the recrawl script at
> >> http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
> >>
> >> :
> >>
> >> inject new urls
> >> fetch loop to "depth":
> >> {
> >>    generate
> >>    fetch
> >>    update db
> >> }
> >> merge segments
> >> invert links
> >> index
> >> dedup
> >> merge indexes
> >>
> >> (I wind up with the merged index in 'index/merge-output', instead of
> >> 'index', but I thought perhaps I could deal with that weirdness when my
> >> index has the stuff I want...)
> >>
> >>
> >>
> >> Now, perhaps I don't understand the patch you pointed me to, but It
> >> seems that it is only meant to avoid recrawling content that hasn't
> >> changed.  It doesn't really have to do with avoiding a rebuild of the
> >> entire
> >> index if I add a document.  Or does it, and I just missed it?
> >>
> >> Does Nutch have the ability to add to an index without a complete
> >> rebuild,
> >> or is a complete rebuild required if I add even a single document?
> >>
> >> Furthermore, even if I were to decide that the complete rebuild is
> >> acceptable, Nutch is still discarding my custom fields from all
> >> documents
> >> that are not being updated.  Why is this happening?
> >>
> >> I appreciate the help; thanks.
> >> -Charlie
> >>
> >>
> >> On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:
> >> >
> >> > Hi Cahrlie:
> >> >
> >> > On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:
> >> > > Greetings,
> >> > >
> >> > > Now I'm at the point where I would like to add to my crawl, with
> >> a new
> >> > set
> >> > > of seed urls.  Using a variation on the recrawl script on the
> >> wiki, I
> >> > can
> >> > > make this happen, but I am running into a what is, for me, a
> >> showstopper
> >> > > issue.  The custom fields I added to the documents of the first
> >> crawl
> >> > are
> >> > > lost when the documents from the second crawl are added to the
> >> index.
> >> >
> >> > Nutch is all about writing once. All operation write once this is how
> >> > map-reduce
> >> > works.. This is why incremental crawling is difficult. But :-)
> >> >
> >> > http://issues.apache.org/jira/browse/NUTCH-61
> >> >
> >> > Like you many others want this to happen. And to the best of my
> >> knowledge
> >> > Andrzej Bialecki will be addressing the issue after 0.9 release ..
> >> which
> >> > is
> >> > anytime now :-)
> >> >
> >> > So you might give it a go with Nutch-61 but NOTE it doesn't work with
> >> > current trunk.
> >> >
> >> > Regards
> >> > raj
> >> >
> >>
> >
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] incremental crawling

Reply via email to