See my comments below in red.




Sue Walker-Thornton

(w):  (757) 864-2368

(m):  (757) 506-9903



-----Original Message-----
From: Tim Donohue [mailto:[email protected]]
Sent: Monday, March 12, 2012 5:36 PM
To: Thornton, Susan M. (LARC-B702)[LITES]
Cc: [email protected]; Stewart, Susan H. (LARC-B702)
Subject: Re: [Dspace-tech] Are PDF-A documents filterable in DSpace?



Hi Sue,



DSpace uses a variety of third-party open source tools to perform 
text-extraction (filtering) of documents.



As of DSpace 1.8.x, here's what we use:



For PDFs, there are two options:

--------------------------------

   1) By default, it uses PDFBox 1.6.0

(http://pdfbox.apache.org/userguide/text_extraction.html). The PDFBox site 
doesn't say whether it works with PDF/A.  But, I just did a test upload of a 
basic PDF/A document to my local repository, and it filtered fine.



   OR



   2) If you are on Linux, you can choose to instead use XPDF command line 
tools (http://www.foolabs.com/xpdf/). It's also unclear if this tool supports 
PDF/A, but I'd guess that it does (as it is kept up to date).

We do use XPDF and I tested filtering a PDF-A document the other day and it 
worked just fine.  We switched from PDFBox to XPDF several years ago because we 
had a lot of problems filtering documents with PDFBox and weren't sure why.  
Ever since our switch, XPDF successfully filters ALL of our documents (that are 
a "filterable" type) unless they are truly corrupt and need to be replaced.  It 
is also much, much faster than PDFBox.  I highly recommend using XPDF rather 
than PDFBox if you're running Linux or Unix.





For Word docs:

--------------

   * The rather outdated "Text-mining" tools at:

http://code.google.com/p/text-mining/

   * Unfortunately it looks like these do NOT support docx

   * But, it looks like POI (used for PPTs, see below) does work for docx. 
Unfortunately, this is not enabled/built out in DSpace yet.  I just created an 
issue for it at: https://jira.duraspace.org/browse/DS-1140

Great!  Can you let us know when it's been successfully implemented?



For PPT:

--------

   * POI 3.6: http://poi.apache.org/

   * This software supports pptx as well

How would I integrate this with DSpace version 1.7.1 to tell DSpace to use POI 
to filter .pptx files?



Hope that helps some.  I've also taken this opportunity to update our 
Documentation so that it has links to all the third-party software we use for 
filtering:

https://wiki.duraspace.org/display/DSDOC18/Transforming+DSpace+Content+%28MediaFilters%29



- Tim



Thanks Tim!

Sue



On 3/7/2012 12:52 PM, Thornton, Susan M. (LARC-B702)[LITES] wrote:

> Hello,

>

> Does anyone know if PDF-A documents are currently filterable in DSpace

> (to make them full-text searchable)? If not, are there any plans for

> adding this in a future release of DSpace?

>

> Also, what about .docx, .xlsx, and .pptx?

>

> Thanks in advance,

>

> Sue

>

> Sue Walker-Thornton

>

> Software Developer/Database Administrator

>

> NASA Langley Research Center - LITES Contract

>

> [email protected]<mailto:[email protected]>

>

> (W) 757-864-2368

>

> (M) 757-506-9903

>

>

>

> ----------------------------------------------------------------------

> -------- Virtualization&  Cloud Management Using Capacity Planning

> Cloud computing makes use of virtualization - but cloud computing also

> focuses on allowing computing to be delivered as a service.

> http://www.accelacomm.com/jaw/sfnl/114/51521223/

>

>

>

> _______________________________________________

> DSpace-tech mailing list

> [email protected]<mailto:[email protected]>

> https://lists.sourceforge.net/lists/listinfo/dspace-tech
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to