Mubey N. wrote:
We are using parse-pdf to parse PDF documents. We modified cached.jsp
to display the parsed content instead of the link to the cached
document. We are using bean.getParseText(details) to get the parsed
text from the cached PDF document. But the output that comes on
cached.jsp is not pretty. Parsed text doesn't have any formatting
information. I am just wondering whether there is anything in Nutch
that could display cached PDF documents with proper formatting or at
least some formatting like headers, paragraphs, etc.?

It's possible to do this, but so far the demand has been low enough that nobody implemented it in the public version ... The strategy is as follows:

* apply patches in https://issues.apache.org/jira/browse/NUTCH-466, which provide support for multiple parts in a segment.

* implement a PDF parser based on pdftohtml.exe (using parse-ext plugin?), which gives quite decent quality HTML output. Store this output in an "html" segment part.

* modify the Cached.java to retrieve HTML content from the "html" part of the segment, in case of PDF document.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to