hi

i run the crawl this way

./bin/nutch crawl urls -dir crawl -depth 3 -topN 500

my urls file

http://localhost/test/


my crawl-urlfilter

+^http://([a-z0-9]*\.)*localhost/


my nutch-site.xml :


<property> 
  <name>plugin.includes</name> 
 
<value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value>
 
  <description>Regular expression naming plugin directory names to 
  include.  Any plugin not matching this expression is excluded. 
  In any case you need at least include the nutch-extensionpoints plugin. By 
  default Nutch includes crawling just HTML and plain text via HTTP, 
  and basic indexing and search plugins. 
  </description> 
</property>
<property>
  <name>http.agent.name</name>
  <value>C:\cygwin\home\nutch-0.8\crawl</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>



i have 340 documents(XML, PDF, DOC) and only take 46 documents?

which is the problem?

thanks
-- 
View this message in context: 
http://www.nabble.com/run-the-crawl-tf4799849.html#a13732232
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to