See my comments below in red.
Sue Walker-Thornton
(w): (757) 864-2368
(m): (757) 506-9903
-----Original Message-----
From: Tim Donohue [mailto:[email protected]]
Sent: Monday, March 12, 2012 5:36 PM
To: Thornton, Susan M. (LARC-B702)[LITES]
Cc: [email protected]; Stewart, Susan H. (LARC-B702)
Subject: Re: [Dspace-tech] Are PDF-A documents filterable in DSpace?
Hi Sue,
DSpace uses a variety of third-party open source tools to perform
text-extraction (filtering) of documents.
As of DSpace 1.8.x, here's what we use:
For PDFs, there are two options:
--------------------------------
1) By default, it uses PDFBox 1.6.0
(http://pdfbox.apache.org/userguide/text_extraction.html). The PDFBox site
doesn't say whether it works with PDF/A. But, I just did a test upload of a
basic PDF/A document to my local repository, and it filtered fine.
OR
2) If you are on Linux, you can choose to instead use XPDF command line
tools (http://www.foolabs.com/xpdf/). It's also unclear if this tool supports
PDF/A, but I'd guess that it does (as it is kept up to date).
We do use XPDF and I tested filtering a PDF-A document the other day and it
worked just fine. We switched from PDFBox to XPDF several years ago because we
had a lot of problems filtering documents with PDFBox and weren't sure why.
Ever since our switch, XPDF successfully filters ALL of our documents (that are
a "filterable" type) unless they are truly corrupt and need to be replaced. It
is also much, much faster than PDFBox. I highly recommend using XPDF rather
than PDFBox if you're running Linux or Unix.
For Word docs:
--------------
* The rather outdated "Text-mining" tools at:
http://code.google.com/p/text-mining/
* Unfortunately it looks like these do NOT support docx
* But, it looks like POI (used for PPTs, see below) does work for docx.
Unfortunately, this is not enabled/built out in DSpace yet. I just created an
issue for it at: https://jira.duraspace.org/browse/DS-1140
Great! Can you let us know when it's been successfully implemented?
For PPT:
--------
* POI 3.6: http://poi.apache.org/
* This software supports pptx as well
How would I integrate this with DSpace version 1.7.1 to tell DSpace to use POI
to filter .pptx files?
Hope that helps some. I've also taken this opportunity to update our
Documentation so that it has links to all the third-party software we use for
filtering:
https://wiki.duraspace.org/display/DSDOC18/Transforming+DSpace+Content+%28MediaFilters%29
- Tim
Thanks Tim!
Sue
On 3/7/2012 12:52 PM, Thornton, Susan M. (LARC-B702)[LITES] wrote:
> Hello,
>
> Does anyone know if PDF-A documents are currently filterable in DSpace
> (to make them full-text searchable)? If not, are there any plans for
> adding this in a future release of DSpace?
>
> Also, what about .docx, .xlsx, and .pptx?
>
> Thanks in advance,
>
> Sue
>
> Sue Walker-Thornton
>
> Software Developer/Database Administrator
>
> NASA Langley Research Center - LITES Contract
>
> [email protected]<mailto:[email protected]>
>
> (W) 757-864-2368
>
> (M) 757-506-9903
>
>
>
> ----------------------------------------------------------------------
> -------- Virtualization& Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing also
> focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> [email protected]<mailto:[email protected]>
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech