RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman Tue, 28 Feb 2006 07:08:54 -0800

thanks for the help.  I dont know what happenned , but it is working no.
Did any other contributros read what I sent about parsing PDFs?
I dont think nutch is capable with this based on the text stripper code
in parse pdf
 
http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd
f/f1040.pdf+irs+1040+pdf
<http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
&hl=en&gl=us&ct=clnk&cd=1
 
 
Its time to implement some real pdf parsing technology.
any other takers?

-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 28, 2006 9:49 AM
To: [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction

In the attached files, nutch-default.xml contains :
protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|que
ry-(basic|site|url)
No parse-pdf is specified....
(the nutch-extensionpoints is not mandatory since the
plugin.autoactivation property is true. The plugins needed by other ones
that are manually activated will be automatically activated).
Is there some plugins in your plugins folder? ( build/plugins)

On 2/28/06, Richard Braman <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> > wrote: 

In nutchdefault

<property>
  <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml|pdf)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

I moved it into nutchdefault from nutch site in an effort to fix the
error, whihc didn;t work.  I want this feature to to be default.

Rich

-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 28, 2006 9:27 AM
To: [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction

Could you please send me the value of the plugin.includes property (in
nutch-default.xml and nutch-site.xml)

On 2/28/06, Richard Braman <  <mailto:[EMAIL PROTECTED]>
[EMAIL PROTECTED]> wrote: 

note ana quick search of the archive didn;t reveal the code to that.
please provide.

-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 28, 2006 8:46 AM
To: [email protected]; [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction

Putting the wellformed version of the plugin code you provided generated
the follwong exception: 

Does the nutch-extensionpoints plugin is activated?

-- 
http://motrech.free.fr/
http://www.frutch.org/

-- 
http://motrech.free.fr/
http://www.frutch.org/

RE: Nutch Parsing PDFs, and general PDF extraction

Reply via email to