Hello ,
I need to access parse text from nutch documents, I am using nuthbean to
search and then access the parseText from it. Here is the sample code
Configuration conf = NutchConfiguration.create();
NutchBean nb = new NutchBean(conf);
Hits hits = nb.search(Query.parse("irs", conf), 10);
//get a sample hit
Hit hit = hits.getHit(8);
HitDetails hitDetails = nb.getDetails(hit);
ParseText pText = nb.getParseText(hitDetails);
System.out.println(pText.getText());
The System.out command prints non readable characters as follows
obj<</Length 31683/Filter/FlateDecode/Length1 1720/Length2 30704/Length3
532>>stream
H‰¤U8Të(R)Ýåt›=*鯶„P†Y†fÆ.'!vb)'´Ì,,fÖŒµÖ¸ÔV*—P¥›¨]QJî%'kDE…ЍØÏ"ê(EÚ‡JÍYklgËÉóœszæyþYÿ÷ÿ»Þï{ßÿ_ºZ|g†•HêÛJQ‚1Í?õˆÆ(c)B&"
1Ù4]]k†DŠò
&sä0°Â
Any idea what I am missing ? The document is a pdf in english.
Thanks !