Please add 'parse-pdf' to the 'plugin.includes' property in
'conf/nutch-site.xml'. For example:-
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(pdf|text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
Regards,
Susam Pal
On Jan 2, 2008 10:45 PM, Developer Developer <[EMAIL PROTECTED]> wrote:
> Hello Ismael,
>
> Thanks for the quick reply. How do I activate the plugin to parse PDF?
>
> Thanks !
>
>
>
> On Jan 2, 2008 12:09 PM, Ismael <[EMAIL PROTECTED]> wrote:
>
> > Are you sure that you have activated the plugin to parse PDF? If you
> > didn't please do it, and if you did I think the problem is the
> > pdf-build. As I know there isn't any PDF text extractor that works
> > absolutely perfect. To parse PDF nutch uses the PDFBox api. You can
> > download it and parse the text manually with it so you can check if
> > the problem goes with nutch or with PDFBox. Also you can download
> > nutch source code and look for the pdf-plugin in folder plugins to see
> > how Nutch uses this api.
> >
> > 2008/1/2, Developer Developer <[EMAIL PROTECTED]>:
> > > Hello ,
> > >
> > > I need to access parse text from nutch documents, I am using nuthbean to
> > > search and then access the parseText from it. Here is the sample code
> > >
> > >
> > >
> > > Configuration conf = NutchConfiguration.create();
> > > NutchBean nb = new NutchBean(conf);
> > > Hits hits = nb.search(Query.parse("irs", conf), 10);
> > >
> > > //get a sample hit
> > > Hit hit = hits.getHit(8);
> > >
> > > HitDetails hitDetails = nb.getDetails(hit);
> > >
> > > ParseText pText = nb.getParseText(hitDetails);
> > >
> > > System.out.println(pText.getText());
> > >
> > > The System.out command prints non readable characters as follows
> > >
> > > obj<</Length 31683/Filter/FlateDecode/Length1 1720/Length2 30704/Length3
> > > 532>>stream
> > > H‰¤U 8Të (R)Ýåt›=*鯶„P†Y†fÆ.'!vb )'´Ì,,fÖŒµÖ¸ÔV*—P
> > ¥›¨]QJî%'kDE…ЍØÏ"ê(EÚ‡JÍYklgËÉóœszæyþYÿ÷ ÿ»Þï{ßÿ_ºZ|g†•Hê ÛJQ‚ 1Í?õˆ Æ
> > (c) B &" 1Ù4]]k † DŠò
> > > &sä0° Â
> > >
> > >
> > > Any idea what I am missing ? The document is a pdf in english.
> > >
> > > Thanks !
> > >
> >
>