Hi Sue, DSpace uses a variety of third-party open source tools to perform text-extraction (filtering) of documents.
As of DSpace 1.8.x, here's what we use: For PDFs, there are two options: -------------------------------- 1) By default, it uses PDFBox 1.6.0 (http://pdfbox.apache.org/userguide/text_extraction.html). The PDFBox site doesn't say whether it works with PDF/A. But, I just did a test upload of a basic PDF/A document to my local repository, and it filtered fine. OR 2) If you are on Linux, you can choose to instead use XPDF command line tools (http://www.foolabs.com/xpdf/). It's also unclear if this tool supports PDF/A, but I'd guess that it does (as it is kept up to date). For Word docs: -------------- * The rather outdated "Text-mining" tools at: http://code.google.com/p/text-mining/ * Unfortunately it looks like these do NOT support docx * But, it looks like POI (used for PPTs, see below) does work for docx. Unfortunately, this is not enabled/built out in DSpace yet. I just created an issue for it at: https://jira.duraspace.org/browse/DS-1140 For PPT: -------- * POI 3.6: http://poi.apache.org/ * This software supports pptx as well Hope that helps some. I've also taken this opportunity to update our Documentation so that it has links to all the third-party software we use for filtering: https://wiki.duraspace.org/display/DSDOC18/Transforming+DSpace+Content+%28MediaFilters%29 - Tim On 3/7/2012 12:52 PM, Thornton, Susan M. (LARC-B702)[LITES] wrote: > Hello, > > Does anyone know if PDF-A documents are currently filterable in DSpace > (to make them full-text searchable)? If not, are there any plans for > adding this in a future release of DSpace? > > Also, what about .docx, .xlsx, and .pptx? > > Thanks in advance, > > Sue > > Sue Walker-Thornton > > Software Developer/Database Administrator > > NASA Langley Research Center - LITES Contract > > [email protected] > > (W) 757-864-2368 > > (M) 757-506-9903 > > > > ------------------------------------------------------------------------------ > Virtualization& Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization - but cloud computing > also focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > > > > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

