> Does anyone have a script that checks all of the previously uploaded > PDFs and find ones that are malformed and reports their URLs/record IDs?
I think it's most appropriate to do this with the MediaFilter mechanism. The default DSpace (1.5.1) distribution includes the plugin: org.dspace.app.mediafilter.PDFFilter which extracts text from a PDF. To do that, it interprets the PDF contents with PDFBox, which is your asthmatic canary in the gassy coal mine that is PDF. You can count on it to keel over even on some files that are roughly legal and can be rendered by xpdf and Adobe Acrobat Reader. Running media-filter will log the Handle of failed Items in the DSpace log. See the manual for more info. It's a lot easier, and sounder practice, to leverage the existing media filter infrastructure than to go digging into the database and assetstore -- that implementation may change even in minor releases and configuration changes. If you want to get more aggressive and precise about validating the PDF, rather than just ensuring it is probably not corrupt, look into JHOVE at http://hul.harvard.edu/jhove/ and keep an eye on JHOVE2 http://confluence.ucop.edu/display/JHOVE2Info/Home -- Larry > I can see how to write a script that uses the unix command line 'file' > and 'pdftops' tools to check that every file that looks like a PDF is a > good and valid PDF. Going from a file on the disk to a database record > I'm not too sure of. > > cheers > stuart > -- > Stuart Yeates > http://www.nzetc.org/ New Zealand Electronic Text Centre > http://researcharchive.vuw.ac.nz/ Institutional Repository > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA > -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open source participation > -Receive a $600 discount off the registration fee with the source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------------ Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

