Hi Sue,

DSpace uses a variety of third-party open source tools to perform 
text-extraction (filtering) of documents.

As of DSpace 1.8.x, here's what we use:

For PDFs, there are two options:
--------------------------------
   1) By default, it uses PDFBox 1.6.0 
(http://pdfbox.apache.org/userguide/text_extraction.html). The PDFBox 
site doesn't say whether it works with PDF/A.  But, I just did a test 
upload of a basic PDF/A document to my local repository, and it filtered 
fine.

   OR

   2) If you are on Linux, you can choose to instead use XPDF command 
line tools (http://www.foolabs.com/xpdf/). It's also unclear if this 
tool supports PDF/A, but I'd guess that it does (as it is kept up to date).


For Word docs:
--------------
   * The rather outdated "Text-mining" tools at:
http://code.google.com/p/text-mining/
   * Unfortunately it looks like these do NOT support docx
   * But, it looks like POI (used for PPTs, see below) does work for 
docx. Unfortunately, this is not enabled/built out in DSpace yet.  I 
just created an issue for it at: https://jira.duraspace.org/browse/DS-1140

For PPT:
--------
   * POI 3.6: http://poi.apache.org/
   * This software supports pptx as well

Hope that helps some.  I've also taken this opportunity to update our 
Documentation so that it has links to all the third-party software we 
use for filtering:
https://wiki.duraspace.org/display/DSDOC18/Transforming+DSpace+Content+%28MediaFilters%29

- Tim

On 3/7/2012 12:52 PM, Thornton, Susan M. (LARC-B702)[LITES] wrote:
> Hello,
>
> Does anyone know if PDF-A documents are currently filterable in DSpace
> (to make them full-text searchable)? If not, are there any plans for
> adding this in a future release of DSpace?
>
> Also, what about .docx, .xlsx, and .pptx?
>
> Thanks in advance,
>
> Sue
>
> Sue Walker-Thornton
>
> Software Developer/Database Administrator
>
> NASA Langley Research Center - LITES Contract
>
> [email protected]
>
> (W) 757-864-2368
>
> (M) 757-506-9903
>
>
>
> ------------------------------------------------------------------------------
> Virtualization&  Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to