Re: Could anyone teache me how to index the title or content of PDF?

Frank Huang Sat, 02 Sep 2006 23:11:08 -0700

Thanks for your help.

I crawl over http and set  http.content.limit like following in
nutch-default:
<property>
  <name>http.content.limit</name>
  <value>16777216</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>


but it still show the same error:
fetch okay,but can`t parse http://(omit...).pdf " reason:failed
<omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

what did I mistake ? thanks 



Tomi NA wrote:
> 
> On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote:
> 
>> But when I execute ./nutch crawl there show some messages like "fetch
>> okay
>> ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
>> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
> 
> Haven't had time to go through the complete code (not sure I'd
> understand it, anyway), but this looks like you need to set
> file.content.limit to, say, 16777216. If you're crawling over http
> rather than intranet shares, the property you need to .set is
> http.content.limit
> 
> Hope it helps.
> 
> 
> t.n.a.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6119492
Sent from the Nutch - User forum at Nabble.com.

Re: Could anyone teache me how to index the title or content of PDF?

Reply via email to