I don't know. But how can I find out if that is the case and how can I fix it?
On 4/19/07, RP <[EMAIL PROTECTED]> wrote: > I've not looked but do they have a robots.txt file or META tag set that > may be stopping things..?? > > rp > > Meryl Silverburgh wrote: > > All, > > > > Can you please help me with my problem? I have posted my question a > > few time, but I still cant solve it. I appreciate if anyone can help > > me with that. > > > > i am trying to setup nutch 0.9 to crawl www.yahoo.com (My setting > > works for cnn.com, msn.com, but not yahoo.com) . > > I am using this command "bin/nutch crawl urls -dir crawl -depth 3". > > > > But after the command, no links have been fetch. > > > > the only strange thing I see in the hadoop log is this warning: > > > > 2007-04-16 23:22:48,062 WARN regex.RegexURLNormalizer - can't find > > rules for scope 'outlink', using default > > > > Is that something I need to setup before www.yahoo.com can be crawled? > > > > Here is the output: > > crawl started in: crawl > > rootUrlDir = urls > > threads = 10 > > depth = 3 > > Injector: starting > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: urls > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: done > > Generator: Selecting best-scoring urls due for fetch. > > Generator: starting > > Generator: segment: crawl/segments/20070416230326 > > Generator: filtering: false > > Generator: topN: 2147483647 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls by host, for politeness. > > Generator: done. > > Fetcher: starting > > Fetcher: segment: crawl/segments/20070416230326 > > Fetcher: threads: 10 > > fetching http://www.yahoo.com/ > > Fetcher: done > > CrawlDb update: starting > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20070416230326] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > CrawlDb update: Merging segment data into db. > > CrawlDb update: done > > Generator: Selecting best-scoring urls due for fetch. > > Generator: starting > > Generator: segment: crawl/segments/20070416230338 > > Generator: filtering: false > > Generator: topN: 2147483647 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=1 - no more URLs to fetch. > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: crawl/segments/20070416230326 > > LinkDb: done > > Indexer: starting > > Indexer: linkdb: crawl/linkdb > > Indexer: adding segment: crawl/segments/20070416230326 > > Indexing [http://www.yahoo.com/] with analyzer > > [EMAIL PROTECTED] (null) > > Optimizing index. > > merging segments _ram_0 (1 docs) into _0 (1 docs) > > Indexer: done > > Dedup: starting > > Dedup: adding indexes in: crawl/indexes > > Dedup: done > > merging indexes to: crawl/index > > Adding crawl/indexes/part-00000 > > done merging > > crawl finished: crawl > > CrawlDb topN: starting (topN=25, min=0.0) > > CrawlDb db: crawl/crawldb > > CrawlDb topN: collecting topN scores. > > CrawlDb topN: done > > Match > > > > On 4/18/07, c wanek <[EMAIL PROTECTED]> wrote: > >> Thanks Raj, > >> > >> First of all, here's some info I didn't include in the original > >> question; > >> I'm using Nutch .9, and my attempt to add to my index is basically a > >> variation of the recrawl script at > >> http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 > >> > >> : > >> > >> inject new urls > >> fetch loop to "depth": > >> { > >> generate > >> fetch > >> update db > >> } > >> merge segments > >> invert links > >> index > >> dedup > >> merge indexes > >> > >> (I wind up with the merged index in 'index/merge-output', instead of > >> 'index', but I thought perhaps I could deal with that weirdness when my > >> index has the stuff I want...) > >> > >> > >> > >> Now, perhaps I don't understand the patch you pointed me to, but It > >> seems that it is only meant to avoid recrawling content that hasn't > >> changed. It doesn't really have to do with avoiding a rebuild of the > >> entire > >> index if I add a document. Or does it, and I just missed it? > >> > >> Does Nutch have the ability to add to an index without a complete > >> rebuild, > >> or is a complete rebuild required if I add even a single document? > >> > >> Furthermore, even if I were to decide that the complete rebuild is > >> acceptable, Nutch is still discarding my custom fields from all > >> documents > >> that are not being updated. Why is this happening? > >> > >> I appreciate the help; thanks. > >> -Charlie > >> > >> > >> On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote: > >> > > >> > Hi Cahrlie: > >> > > >> > On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote: > >> > > Greetings, > >> > > > >> > > Now I'm at the point where I would like to add to my crawl, with > >> a new > >> > set > >> > > of seed urls. Using a variation on the recrawl script on the > >> wiki, I > >> > can > >> > > make this happen, but I am running into a what is, for me, a > >> showstopper > >> > > issue. The custom fields I added to the documents of the first > >> crawl > >> > are > >> > > lost when the documents from the second crawl are added to the > >> index. > >> > > >> > Nutch is all about writing once. All operation write once this is how > >> > map-reduce > >> > works.. This is why incremental crawling is difficult. But :-) > >> > > >> > http://issues.apache.org/jira/browse/NUTCH-61 > >> > > >> > Like you many others want this to happen. And to the best of my > >> knowledge > >> > Andrzej Bialecki will be addressing the issue after 0.9 release .. > >> which > >> > is > >> > anytime now :-) > >> > > >> > So you might give it a go with Nutch-61 but NOTE it doesn't work with > >> > current trunk. > >> > > >> > Regards > >> > raj > >> > > >> > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
