hi i run the crawl this way
./bin/nutch crawl urls -dir crawl -depth 3 -topN 500 my urls file http://localhost/test/ my crawl-urlfilter +^http://([a-z0-9]*\.)*localhost/ my nutch-site.xml : <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> <property> <name>http.agent.name</name> <value>C:\cygwin\home\nutch-0.8\crawl</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> i have 340 documents(XML, PDF, DOC) and only take 46 documents? which is the problem? thanks -- View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13732232 Sent from the Nutch - User mailing list archive at Nabble.com.
