> Does anyone have a script that checks all of the previously uploaded
> PDFs and find ones that are malformed and reports their URLs/record IDs?

I think it's most appropriate to do this with the MediaFilter mechanism.
The default DSpace (1.5.1) distribution includes  the plugin:
org.dspace.app.mediafilter.PDFFilter
which extracts text from a PDF.  To do that, it interprets the PDF
contents with PDFBox, which is your asthmatic canary in the gassy
coal mine that is PDF.  You can count on it to keel over even on some files
that are roughly legal and can be rendered by xpdf and Adobe Acrobat Reader.

Running media-filter will log the Handle of failed Items in the DSpace
log.  See the manual for more info.  It's a lot easier, and sounder
practice, to leverage the existing media filter infrastructure than to
go digging into the database and assetstore -- that implementation may
change even in minor releases and configuration changes.

If you want to get more aggressive and precise about validating the PDF,
rather than just ensuring it is probably not corrupt, look into JHOVE
at http://hul.harvard.edu/jhove/ and keep an eye on JHOVE2
http://confluence.ucop.edu/display/JHOVE2Info/Home

    -- Larry

> I can see how to write a script that uses the unix command line 'file'
> and 'pdftops' tools to check that every file that looks like a PDF is a
> good and valid PDF. Going from a file on the disk to a database record
> I'm not too sure of.


















>
> cheers
> stuart
> --
> Stuart Yeates
> http://www.nzetc.org/       New Zealand Electronic Text Centre
> http://researcharchive.vuw.ac.nz/     Institutional Repository
>
> ------------------------------------------------------------------------------
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to