i have PDF documents of 120 mb size showme this message
Parser can't handle incomplete pdf file. why in my nutch-default file file.content.limit = -1 indexer.max.tokens = 2147483647 what configuration i have do? thanks payo wrote: > > hi > > i run the crawl this way > > ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500 > > my urls file > > http://localhost/test/ > > > my crawl-urlfilter > > +^http://([a-z0-9]*\.)*localhost/ > > > my nutch-site.xml : > > > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value> > > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. > </description> > </property> > <property> > <name>http.agent.name</name> > <value>C:\cygwin\home\nutch-0.8\crawl</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your organization. > > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > > and set their values appropriately. > > </description> > </property> > > <property> > <name>http.agent.description</name> > <value></value> > <description>Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent name. > </description> > </property> > > <property> > <name>http.agent.url</name> > <value></value> > <description>A URL to advertise in the User-Agent header. This will > appear in parenthesis after the agent name. Custom dictates that this > should be a URL of a page explaining the purpose and behavior of this > crawler. > </description> > </property> > > <property> > <name>http.agent.email</name> > <value></value> > <description>An email address to advertise in the HTTP 'From' request > header and User-Agent header. A good practice is to mangle this > address (e.g. 'info at example dot com') to avoid spamming. > </description> > </property> > > > > i have 340 documents(XML, PDF, DOC) and only take 46 documents? > > which is the problem? > > thanks > -- View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13737011 Sent from the Nutch - User mailing list archive at Nabble.com.
