Are you sure that you have activated the plugin to parse PDF? If you
didn't please do it, and if you did I think the problem is the
pdf-build. As I know there isn't any PDF text extractor that works
absolutely perfect. To parse PDF nutch uses the PDFBox api. You can
download it and parse the text manually with it so you can check if
the problem goes with nutch or with PDFBox. Also you can download
nutch source code and look for the pdf-plugin in folder plugins to see
how Nutch uses this api.
2008/1/2, Developer Developer <[EMAIL PROTECTED]>:
> Hello ,
>
> I need to access parse text from nutch documents, I am using nuthbean to
> search and then access the parseText from it. Here is the sample code
>
>
>
> Configuration conf = NutchConfiguration.create();
> NutchBean nb = new NutchBean(conf);
> Hits hits = nb.search(Query.parse("irs", conf), 10);
>
> //get a sample hit
> Hit hit = hits.getHit(8);
>
> HitDetails hitDetails = nb.getDetails(hit);
>
> ParseText pText = nb.getParseText(hitDetails);
>
> System.out.println(pText.getText());
>
> The System.out command prints non readable characters as follows
>
> obj<</Length 31683/Filter/FlateDecode/Length1 1720/Length2 30704/Length3
> 532>>stream
> H‰¤U 8Të (R)Ýåt›=*鯶„P†Y†fÆ.'!vb )'´Ì,,fÖŒµÖ¸ÔV*—P
> ¥›¨]QJî%'kDE…ЍØÏ"ê(EÚ‡JÍYklgËÉóœszæyþYÿ÷ ÿ»Þï{ßÿ_ºZ|g†•Hê ÛJQ‚ 1Í?õˆ Æ (c)
> B &" 1Ù4]]k † DŠò
> &sä0° Â
>
>
> Any idea what I am missing ? The document is a pdf in english.
>
> Thanks !
>