thanks for the help. I dont know what happenned , but it is working no. Did any other contributros read what I sent about parsing PDFs? I dont think nutch is capable with this based on the text stripper code in parse pdf http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd f/f1040.pdf+irs+1040+pdf <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1> &hl=en&gl=us&ct=clnk&cd=1 Its time to implement some real pdf parsing technology. any other takers?
-----Original Message----- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 9:49 AM To: [EMAIL PROTECTED] Subject: Re: Nutch Parsing PDFs, and general PDF extraction In the attached files, nutch-default.xml contains : protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|que ry-(basic|site|url) No parse-pdf is specified.... (the nutch-extensionpoints is not mandatory since the plugin.autoactivation property is true. The plugins needed by other ones that are manually activated will be automatically activated). Is there some plugins in your plugins folder? ( build/plugins) On 2/28/06, Richard Braman <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > wrote: In nutchdefault <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h tml|pdf)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> I moved it into nutchdefault from nutch site in an effort to fix the error, whihc didn;t work. I want this feature to to be default. Rich -----Original Message----- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 9:27 AM To: [EMAIL PROTECTED] Subject: Re: Nutch Parsing PDFs, and general PDF extraction Could you please send me the value of the plugin.includes property (in nutch-default.xml and nutch-site.xml) On 2/28/06, Richard Braman < <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED]> wrote: note ana quick search of the archive didn;t reveal the code to that. please provide. -----Original Message----- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 28, 2006 8:46 AM To: [email protected]; [EMAIL PROTECTED] Subject: Re: Nutch Parsing PDFs, and general PDF extraction Putting the wellformed version of the plugin code you provided generated the follwong exception: Does the nutch-extensionpoints plugin is activated? -- http://motrech.free.fr/ http://www.frutch.org/ -- http://motrech.free.fr/ http://www.frutch.org/
