Re: parse-pdf output is not pretty in cached.jsp

Andrzej Bialecki Tue, 30 Oct 2007 02:55:26 -0800

Mubey N. wrote:

We are using parse-pdf to parse PDF documents. We modified cached.jsp
to display the parsed content instead of the link to the cached
document. We are using bean.getParseText(details) to get the parsed
text from the cached PDF document. But the output that comes on
cached.jsp is not pretty. Parsed text doesn't have any formatting
information. I am just wondering whether there is anything in Nutch
that could display cached PDF documents with proper formatting or at
least some formatting like headers, paragraphs, etc.?

It's possible to do this, but so far the demand has been low enough thatnobody implemented it in the public version ... The strategy is as follows:

* apply patches in https://issues.apache.org/jira/browse/NUTCH-466,which provide support for multiple parts in a segment.

* implement a PDF parser based on pdftohtml.exe (using parse-extplugin?), which gives quite decent quality HTML output. Store thisoutput in an "html" segment part.

* modify the Cached.java to retrieve HTML content from the "html" partof the segment, in case of PDF document.




--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: parse-pdf output is not pretty in cached.jsp

Reply via email to