Mubey N. wrote:
We are using parse-pdf to parse PDF documents. We modified cached.jsp to display the parsed content instead of the link to the cached document. We are using bean.getParseText(details) to get the parsed text from the cached PDF document. But the output that comes on cached.jsp is not pretty. Parsed text doesn't have any formatting information. I am just wondering whether there is anything in Nutch that could display cached PDF documents with proper formatting or at least some formatting like headers, paragraphs, etc.?
It's possible to do this, but so far the demand has been low enough that nobody implemented it in the public version ... The strategy is as follows:
* apply patches in https://issues.apache.org/jira/browse/NUTCH-466, which provide support for multiple parts in a segment.
* implement a PDF parser based on pdftohtml.exe (using parse-ext plugin?), which gives quite decent quality HTML output. Store this output in an "html" segment part.
* modify the Cached.java to retrieve HTML content from the "html" part of the segment, in case of PDF document.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
