[Nutch-general] Re: parse pdf

Clint Cagle Thu, 14 Jul 2005 16:27:41 -0700

got it working...thank you!
On Jul 14, 2005, at 2:25 PM, Rob Pettengill wrote:

in conf/nutch-site.xml

#add the parse-pdf plugin to plugin.includes
<property>
  <name>plugin.includes</name>

alue>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

#set http.content.limit large enough for the pdf's you want to parse
<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.

If this value is nonnegative (>=0), content longer than it willbe truncated;

  otherwise, no truncation at all.
  </description>
</property>

# in regex-urlfilter.txt remove pdf from the list of link suffixesthat you skip


# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bin)$


;rob

--
Robert C. Pettengill, Ph.D.
   [EMAIL PROTECTED]

Questions about petroleum?
    Goto:   http://AskAboutOil.com/
Need help implementing search?
    Goto:   http://MesaVida.com/


On 2005, Jul 13, at 8:41 PM, Clint Cagle wrote:

Nutch question - how do you enable the pdf plugin to parse pdf




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: parse pdf

Reply via email to