Re: run the crawl

payo Tue, 13 Nov 2007 15:24:46 -0800

i have PDF documents of 120 mb size

showme this message


 Parser can't handle incomplete pdf file.

why

in my nutch-default file 

file.content.limit = -1

indexer.max.tokens =  2147483647

what configuration i have do?

thanks





payo wrote:
> 
> hi
> 
> i run the crawl this way
> 
> ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
> 
> my urls file
> 
> http://localhost/test/
> 
> 
> my crawl-urlfilter
> 
> +^http://([a-z0-9]*\.)*localhost/
> 
> 
> my nutch-site.xml :
> 
> 
> <property> 
>   <name>plugin.includes</name> 
>  
> <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value>
>  
>   <description>Regular expression naming plugin directory names to 
>   include.  Any plugin not matching this expression is excluded. 
>   In any case you need at least include the nutch-extensionpoints plugin.
> By 
>   default Nutch includes crawling just HTML and plain text via HTTP, 
>   and basic indexing and search plugins. 
>   </description> 
> </property>
> <property>
>   <name>http.agent.name</name>
>   <value>C:\cygwin\home\nutch-0.8\crawl</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
>   please set this to a single word uniquely related to your organization.
> 
>   NOTE: You should also check other related properties:
> 
>       http.robots.agents
>       http.agent.description
>       http.agent.url
>       http.agent.email
>       http.agent.version
> 
>   and set their values appropriately.
> 
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.description</name>
>   <value></value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.url</name>
>   <value></value>
>   <description>A URL to advertise in the User-Agent header.  This will 
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.email</name>
>   <value></value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> 
> 
> 
> i have 340 documents(XML, PDF, DOC) and only take 46 documents?
> 
> which is the problem?
> 
> thanks
> 

-- 
View this message in context: 
http://www.nabble.com/run-the-crawl-tf4799849.html#a13737011
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: run the crawl

Reply via email to