[
https://jira.duraspace.org/browse/DS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=24416#comment-24416
]
Richard Rodgers commented on DS-1140:
-------------------------------------
There is already a reimplementation of text extraction a curation task suite.It
uses the Apache Tika framework - which updates all the extractor libraries as
well as adding support for dozens of new formats (open doc, etc).
See github project:
https://github.com/richardrodgers/ctask/tree/master/mediafilter and Tika:
http://tika.apache.org/
So I guess that's volunteering...
> Update MSWord Media Filter to use Apache POI (like PPT Filter) and also
> support .docx
> -------------------------------------------------------------------------------------
>
> Key: DS-1140
> URL: https://jira.duraspace.org/browse/DS-1140
> Project: DSpace
> Issue Type: Improvement
> Components: DSpace API
> Reporter: Tim Donohue
> Fix For: 3.0
>
>
> The Microsoft Word Media Filter (org.dspace.app.mediafilter.WordFilter) uses
> outdated, obsolete third party software, specifically the "text-mining" tools
> at: http://code.google.com/p/text-mining/
> However, there are now better options out there, especially Apache POI.
> http://poi.apache.org/text-extraction.html
> Apache POI also has the benefit of being able to extract text from docx, xls,
> xlsx and even Publisher and Visio files.
> We may even be able to create a single "MSFilter" which can just extract doc,
> docx, ppt, pptx, xls, xlsx, etc. all using POI.
> Any volunteers to implement? Looks like we should be able to implement it
> similar to the current PPT Filter
> (org.dspace.app.mediafilter.PowerPointFilter) which already uses POI. See
> also DS-714.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel