Hello ,

I need to access parse text from nutch documents, I am using nuthbean to
search and then access the parseText from it. Here is the sample code



Configuration conf = NutchConfiguration.create();
NutchBean nb = new NutchBean(conf);
Hits hits = nb.search(Query.parse("irs", conf), 10);

//get a sample hit
Hit hit = hits.getHit(8);

HitDetails hitDetails = nb.getDetails(hit);

ParseText pText = nb.getParseText(hitDetails);

System.out.println(pText.getText());

The System.out command prints non readable characters as follows

obj<</Length 31683/Filter/FlateDecode/Length1 1720/Length2 30704/Length3
532>>stream
H‰¤U8Të(R)Ýåt›=*鯶„P†Y†fÆ.'!vb)'´Ì,,fÖŒµÖ¸ÔV*—P¥›¨]QJî%'kDE…ЍØÏ"ê(EÚ‡JÍYklgËÉóœszæyþYÿ÷ÿ»Þï{ßÿ_ºZ|g†•HêÛJQ‚1Í?õˆÆ(c)B&"
1Ù4]]k†DŠò
&sä0°Â


Any idea what I am  missing ? The document is a pdf in english.

Thanks !

Reply via email to