[Dspace-tech] script to validate all PDFs ?
Does anyone have a script that checks all of the previously uploaded PDFs and find ones that are malformed and reports their URLs/record IDs? I can see how to write a script that uses the unix command line 'file' and 'pdftops' tools to check that every file that looks like a PDF is a good and valid PDF. Going from a file on the disk to a database record I'm not too sure of. cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] script to validate all PDFs ?
Hi Stuart, Example assetstore file: ${dspace.dir}/assetstore/95/80/98/95809816172544348784747013964495251419 The filename itself is in bitstream.internal_id in the dspace database, and the directory names are just the first 6 numbers of the internal ID. Here's a SQL query that resolves internal_ids to item_id (aka record ID) and handle (which should tie into URL): select item.item_id,handle,bitstream.internal_id from item,item2bundle,bundle2bitstream,handle,bitstream where item.item_id = item2bundle.item_id and item2bundle.bundle_id = bundle2bitstream.bundle_id and bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle.resource_id = item.item_id; I've never looked at writing a script based on this (we are just doing the standard checksum checking at the moment) but it shouldn't be too difficult. (if you want to cut down on analysing non-PDFs with 'file', you could use bitstream.bitstream_format_id to build a list of PDFs before running the filesystem-level tools, too..) Cheers, Kim. -- Kim Shepherd IRR Technical Specialist ITS Systems Development The University of Waikato DDI +64 7 838 4025 -Original Message- From: stuart yeates [mailto:stuart.yea...@vuw.ac.nz] Sent: Wednesday, 25 February 2009 9:03 a.m. To: dspace-tech@lists.sourceforge.net Subject: [Dspace-tech] script to validate all PDFs ? Does anyone have a script that checks all of the previously uploaded PDFs and find ones that are malformed and reports their URLs/record IDs? I can see how to write a script that uses the unix command line 'file' and 'pdftops' tools to check that every file that looks like a PDF is a good and valid PDF. Going from a file on the disk to a database record I'm not too sure of. cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository --- --- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] script to validate all PDFs ?
Does anyone have a script that checks all of the previously uploaded PDFs and find ones that are malformed and reports their URLs/record IDs? I think it's most appropriate to do this with the MediaFilter mechanism. The default DSpace (1.5.1) distribution includes the plugin: org.dspace.app.mediafilter.PDFFilter which extracts text from a PDF. To do that, it interprets the PDF contents with PDFBox, which is your asthmatic canary in the gassy coal mine that is PDF. You can count on it to keel over even on some files that are roughly legal and can be rendered by xpdf and Adobe Acrobat Reader. Running media-filter will log the Handle of failed Items in the DSpace log. See the manual for more info. It's a lot easier, and sounder practice, to leverage the existing media filter infrastructure than to go digging into the database and assetstore -- that implementation may change even in minor releases and configuration changes. If you want to get more aggressive and precise about validating the PDF, rather than just ensuring it is probably not corrupt, look into JHOVE at http://hul.harvard.edu/jhove/ and keep an eye on JHOVE2 http://confluence.ucop.edu/display/JHOVE2Info/Home -- Larry I can see how to write a script that uses the unix command line 'file' and 'pdftops' tools to check that every file that looks like a PDF is a good and valid PDF. Going from a file on the disk to a database record I'm not too sure of. cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech