Ian, can you please help me with my problem too?
i am trying to setup nutch 0.9 to crawl www.yahoo.com. I am using this command "bin/nutch crawl urls -dir crawl -depth 3". But after the command, no links have been fetch. the only strange thing I see in the hadoop log is this warning: 2007-04-16 23:22:48,062 WARN regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default Is that something I need to setup before www.yahoo.com can be crawled? Here is the output: crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20070416230326 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20070416230326 Fetcher: threads: 10 fetching http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20070416230326] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20070416230338 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20070416230326 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070416230326 Indexing [http://www.yahoo.com/] with analyzer [EMAIL PROTECTED] (null) Optimizing index. merging segments _ram_0 (1 docs) into _0 (1 docs) Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding crawl/indexes/part-00000 done merging crawl finished: crawl CrawlDb topN: starting (topN=25, min=0.0) CrawlDb db: crawl/crawldb CrawlDb topN: collecting topN scores. CrawlDb topN: done Match On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote: > Hi Anita. > > I tried crawling autos.aols.com, and I could find pages similar to > what our looking at in 3 crawls. (I injected http://autos.aol.com/ > and added autos.aol.com to my regex filter to allow it) > > > eg. > fetching http://autos.aol.com/bmw-650-2007:8774-photos > fetching http://autos.aol.com/article/general/v2/_a/auto- > financing-101/20060818153509990001 > fetching http://autos.aol.com/options_trimless?v=8544 > fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl > fetching http://autos.aol.com/bmw-m-2007:8905-overview > fetching http://autos.aol.com/getaquote?myid=8623 > fetching http://autos.aol.com/options_trimless?v=8226 > fetching http://autos.aol.com/options_trimless?v=7803 > fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger- > srt8/20061030193309990001 > fetching http://autos.aol.com/bmw-x3-2007:8770-specs > fetching http://autos.aol.com/saturn-vue-2007:8371-overview > fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-overview > fetching http://autos.aol.com/options_trimless?v=8394 > fetching http://autos.aol.com/jaguar-listings:JA--- > fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview > fetching http://autos.aol.com/bmw-x5-2007:8817-overview > fetching http://autos.aol.com/audi-a4-2007:8622-specs > fetching http://autos.aol.com/options_trimless?v=8416 > fetching http://autos.aol.com/getaquote?myid=8774 > > the differences is that I am using the latest nutch (SVN head), and > am just using a local store, not hadoop. > > what I would do next if I were you is to check your regex filters to > make sure you are not blocking things with a colon ':' in them for > some strange reason, > and possibly upgrade to the latest and greatest version of nutch. > (0.9.1) > > regards > Ian. > > > > On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote: > > > Hi > > > > I am a new Nutch user, and am using Nutch 8.1 with Hadoop. The > > domain I am > > trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I > > am crawling > > to the depth of 10. > > There are certain pages that Nutch could not fetch. An example > > would be > > _http://autos.aol.com/acura-rl-2006:8060-review_ > > (http://autos.aol.com/acura-rl-2006:8060-review) . > > > > The referring url to this page is > > _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// > > autos.aol.com/acura-rl-2007:8060-review) . This url was there > > in the fetch list. > > > > I did a mini crawl pointing directly to > > _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// > > autos.aol.com/acura-rl-2007:8060-review) , then the page > > _http://autos.aol.com/acura-rl-2006:8060-review_ > > (http://autos.aol.com/acura-rl-2006:8060-review) gets fetched. > > > > Does anyone have any ideas on why I am seeing this behavior. > > > > > > Thanks > > Anita Bidari (X55746) > > > > > > > > > > ************************************** See what's free at http:// > > www.aol.com. > > Ian Holsman > [EMAIL PROTECTED] > http://parent-chatter.com -- what do parents know? > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
