[ https://issues.apache.org/jira/browse/JSPWIKI-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376633#comment-16376633 ]
Ulf Dittmer commented on JSPWIKI-469: ------------------------------------- Rather than use individual libraries, I think a better approach would be to use a general solution like [Apache Tika|http://tika.apache.org/] (which uses only Apache-compatible licenses, and covers a lot more file formats than the libraries listed above). I've used it to add document indexing to a Lucene index, and it's not a big task. While I'm not in a position to contribute a patch (don't have time do dig into the JSPWiki code base), I'd be happy to work with someone who knows their way around the code. Caveat: Tika comes with a LOT of dependent libraries, including POI and PDFBox - its jar file is about 55MB. > Enhance LuceneSearchProvider for other Attachments > --------------------------------------------------- > > Key: JSPWIKI-469 > URL: https://issues.apache.org/jira/browse/JSPWIKI-469 > Project: JSPWiki > Issue Type: Improvement > Reporter: NicolaFischer > Assignee: Florian Holeczek > Priority: Minor > Fix For: FutureVersion > > Attachments: patch.txt > > > LuceneProvider should index more filestypes then only plain text. This is one > attempt to index pdf-files. > Required jars: > * [Apache POI|http://ftp.tpnet.pl/vol/d1/apache/poi/release/bin] (not tested > with 3.0.1 final) > * [PDFBox|http://www.pdfbox.org] > * [FontBox|http://www.fontbox.org] > * [OpenDocumentTextInputStream|http://books.evc-cit.info/odf_utils/index.html] > Patch attached for 2.8.1 > Maybe we should check how to index more documents. -- This message was sent by Atlassian JIRA (v7.6.3#76005)