[ 
https://issues.apache.org/jira/browse/JSPWIKI-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376633#comment-16376633
 ] 

Ulf Dittmer commented on JSPWIKI-469:
-------------------------------------

Rather than use individual libraries, I think a better approach would be to use 
a general solution like [Apache Tika|http://tika.apache.org/] (which uses only 
Apache-compatible licenses, and covers a lot more file formats than the 
libraries listed above). I've used it to add document indexing to a Lucene 
index, and it's not a big task. While I'm not in a position to contribute a 
patch (don't have time do dig into the JSPWiki code base), I'd be happy to work 
with someone who knows their way around the code.

Caveat: Tika comes with a LOT of dependent libraries, including POI and PDFBox 
- its jar file is about 55MB.

> Enhance LuceneSearchProvider for other Attachments 
> ---------------------------------------------------
>
>                 Key: JSPWIKI-469
>                 URL: https://issues.apache.org/jira/browse/JSPWIKI-469
>             Project: JSPWiki
>          Issue Type: Improvement
>            Reporter: NicolaFischer
>            Assignee: Florian Holeczek
>            Priority: Minor
>             Fix For: FutureVersion
>
>         Attachments: patch.txt
>
>
> LuceneProvider should index more filestypes then only plain text. This is one 
> attempt to index pdf-files.
> Required jars:
> * [Apache POI|http://ftp.tpnet.pl/vol/d1/apache/poi/release/bin] (not tested 
> with 3.0.1 final)
> * [PDFBox|http://www.pdfbox.org]
> * [FontBox|http://www.fontbox.org] 
> * [OpenDocumentTextInputStream|http://books.evc-cit.info/odf_utils/index.html]
> Patch attached for 2.8.1
> Maybe we should check how to index more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to