Hi, with nutch-2008-06-26_04-01-58 I'm trying to index a few pages from the Microsoft support knowledge base. I put the URLs in a file called 'urlall' which looks like this:
http://support.microsoft.com/kb/317507/en-us http://support.microsoft.com/kb/295115/en-us http://support.microsoft.com/kb/295117/en-us http://support.microsoft.com/kb/840701/en-us http://support.microsoft.com/kb/924611/en-us http://support.microsoft.com/kb/158509/en-us http://support.microsoft.com/kb/259258/en-us http://support.microsoft.com/kb/287070/en-us I want to index those 8 pages only. Now I run the following command to crawl the sites: bin/nutch crawl /Users/dominik/Documents/MastersThesis/nutch/urls -dir /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl -depth 1 -topN 100 -threads 100 When the crawl is finished only 5 of 8 pages are indexed. Can you tell me why, or what to change so that all sites from 'urlall' get indexed? Thank you! Here's the output from the crawl command: crawl started in: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl rootUrlDir = /Users/dominik/Documents/MastersThesis/nutch/urls threads = 100 depth = 1 topN = 100 Injector: starting Injector: crawlDb: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/crawldb Injector: urlDir: /Users/dominik/Documents/MastersThesis/nutch/urls Injector: Converting injected urls to crawl db entries. Skipping {\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf330:java.net.MalformedURLException: no protocol: {\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf330 Skipping {\fonttbl\f0\fswiss\fcharset0 Helvetica;}:java.net.MalformedURLException: no protocol: {\fonttbl\f0\fswiss\fcharset0 Helvetica;} Skipping {\colortbl;\red255\green255\blue255;}:java.net.MalformedURLException: no protocol: {\colortbl;\red255\green255\blue255;} Skipping \paperw11900\paperh16840\margl1440\margr1440\vieww9000\viewh8400\viewkind0:java.net.MalformedURLException: no protocol: \paperw11900\paperh16840\margl1440\margr1440\vieww9000\viewh8400\viewkind0 Skipping \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural:java.net.MalformedURLException: no protocol: \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural Skipping \f0\fs24 \cf0 \:java.net.MalformedURLException: no protocol: \f0\fs24 \cf0 \ Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652 Generator: filtering: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652 Fetcher: threads: 100 fetching http://support.microsoft.com/kb/259258/en-us\ fetching http://support.microsoft.com/kb/317507/en-us\ fetching http://support.microsoft.com/kb/295117/en-us\ fetching http://support.microsoft.com/kb/158509/en-us\ fetching http://support.microsoft.com/kb/295115/en-us\ fetching http://support.microsoft.com/kb/287070/en-us} fetching http://support.microsoft.com/kb/840701/en-us\ fetching http://support.microsoft.com/kb/924611/en-us\ Fetcher: done CrawlDb update: starting CrawlDb update: db: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/crawldb CrawlDb update: segments: [/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652 LinkDb: done Indexer: starting Indexer: linkdb: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/linkdb Indexer: adding segment: file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652 IFD [Thread-151]: setInfoStream [EMAIL PROTECTED] IW 0 [Thread-151]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/private/tmp/hadoop-dominik/mapred/local/index/_-173514222 autoCommit=true [EMAIL PROTECTED] [EMAIL PROTECTED] ramBufferSizeMB=16.0 maxBuffereDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=10000 index= Indexing [http://support.microsoft.com/kb/287070/en-us}] with analyzer [EMAIL PROTECTED] (null) Indexing [http://support.microsoft.com/kb/295115/en-us\] with analyzer [EMAIL PROTECTED] (null) Indexing [http://support.microsoft.com/kb/317507/en-us\] with analyzer [EMAIL PROTECTED] (null) Indexing [http://support.microsoft.com/kb/840701/en-us\] with analyzer [EMAIL PROTECTED] (null) Indexing [http://support.microsoft.com/kb/924611/en-us\] with analyzer [EMAIL PROTECTED] (null) Optimizing index. IW 0 [Thread-151]: optimize: index now IW 0 [Thread-151]: flush: segment=_0 docStoreSegment=_0 docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=true numDocs=5 numBufDelTerms=0 IW 0 [Thread-151]: index before flush flush postings as segment _0 numDocs=5 closeDocStore: 2 files to flush to segment _0 oldRAMSize=76608 newFlushedSize=30888 docs/MB=169.738 new/old=40.32% IW 0 [Thread-151]: checkpoint: wrote segments file "segments_2" IFD [Thread-151]: now checkpoint "segments_2" [1 segments ; isCommit = true] IFD [Thread-151]: deleteCommits: now remove commit "segments_1" IFD [Thread-151]: delete "segments_1" IW 0 [Thread-151]: LMP: findMerges: 1 segments IW 0 [Thread-151]: LMP: level -1.0 to 2.6517506: 1 segments IW 0 [Thread-151]: CMS: now merge IW 0 [Thread-151]: CMS: index: _0:C5 IW 0 [Thread-151]: CMS: no more merges pending; now return IW 0 [Thread-151]: CMS: now merge IW 0 [Thread-151]: CMS: index: _0:C5 IW 0 [Thread-151]: CMS: no more merges pending; now return IW 0 [Thread-151]: now flush at close IW 0 [Thread-151]: flush: segment=null docStoreSegment=null docStoreOffset=0 flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 numBufDelTerms=0 IW 0 [Thread-151]: index before flush _0:C5 IW 0 [Thread-151]: at close: _0:C5 Indexer: done Dedup: starting Dedup: adding indexes in: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/indexes Dedup: done merging indexes to: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/index Adding file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/indexes/part-00000 done merging crawl finished: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl -- View this message in context: http://www.nabble.com/Nutch-not-indexing-all-fetched-sites-tp18290960p18290960.html Sent from the Nutch - User mailing list archive at Nabble.com.
