[Dspace-tech] script to validate all PDFs ?

2009-02-24 Thread stuart yeates
Does anyone have a script that checks all of the previously uploaded 
PDFs and find ones that are malformed and reports their URLs/record IDs?

I can see how to write a script that uses the unix command line 'file' 
and 'pdftops' tools to check that every file that looks like a PDF is a 
good and valid PDF. Going from a file on the disk to a database record 
I'm not too sure of.

cheers
stuart
-- 
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository

--
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] script to validate all PDFs ?

2009-02-24 Thread Kim Shepherd
Hi Stuart,

Example assetstore file:

${dspace.dir}/assetstore/95/80/98/95809816172544348784747013964495251419

The filename itself is in bitstream.internal_id in the dspace database, and the 
directory names are just the first 6 numbers of the internal ID.

Here's a SQL query that resolves internal_ids to item_id (aka record ID) and 
handle (which should tie into URL):

select item.item_id,handle,bitstream.internal_id from 
item,item2bundle,bundle2bitstream,handle,bitstream where item.item_id = 
item2bundle.item_id and item2bundle.bundle_id = bundle2bitstream.bundle_id and 
bundle2bitstream.bitstream_id = bitstream.bitstream_id and handle.resource_id = 
item.item_id;

I've never looked at writing a script based on this (we are just doing the 
standard checksum checking at the moment) but it shouldn't be too difficult.

(if you want to cut down on analysing non-PDFs with 'file', you could use 
bitstream.bitstream_format_id to build a list of PDFs before running the 
filesystem-level tools, too..)

Cheers,

Kim.

--
Kim Shepherd
IRR Technical Specialist
ITS Systems  Development
The University of Waikato
DDI +64 7 838 4025




 -Original Message-
 From: stuart yeates [mailto:stuart.yea...@vuw.ac.nz]
 Sent: Wednesday, 25 February 2009 9:03 a.m.
 To: dspace-tech@lists.sourceforge.net
 Subject: [Dspace-tech] script to validate all PDFs ?
 
 Does anyone have a script that checks all of the previously uploaded
 PDFs and find ones that are malformed and reports their URLs/record IDs?
 
 I can see how to write a script that uses the unix command line 'file'
 and 'pdftops' tools to check that every file that looks like a PDF is a
 good and valid PDF. Going from a file on the disk to a database record
 I'm not too sure of.
 
 cheers
 stuart
 --
 Stuart Yeates
 http://www.nzetc.org/   New Zealand Electronic Text Centre
 http://researcharchive.vuw.ac.nz/ Institutional Repository
 
 ---
 ---
 Open Source Business Conference (OSBC), March 24-25, 2009, San
 Francisco, CA
 -OSBC tackles the biggest issue in open source: Open Sourcing the
 Enterprise
 -Strategies to boost innovation and cut costs with open source
 participation
 -Receive a $600 discount off the registration fee with the source code:
 SFAD
 http://p.sf.net/sfu/XcvMzF8H
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech
--
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] script to validate all PDFs ?

2009-02-24 Thread Larry Stone
 Does anyone have a script that checks all of the previously uploaded
 PDFs and find ones that are malformed and reports their URLs/record IDs?

I think it's most appropriate to do this with the MediaFilter mechanism.
The default DSpace (1.5.1) distribution includes  the plugin:
org.dspace.app.mediafilter.PDFFilter
which extracts text from a PDF.  To do that, it interprets the PDF
contents with PDFBox, which is your asthmatic canary in the gassy
coal mine that is PDF.  You can count on it to keel over even on some files
that are roughly legal and can be rendered by xpdf and Adobe Acrobat Reader.

Running media-filter will log the Handle of failed Items in the DSpace
log.  See the manual for more info.  It's a lot easier, and sounder
practice, to leverage the existing media filter infrastructure than to
go digging into the database and assetstore -- that implementation may
change even in minor releases and configuration changes.

If you want to get more aggressive and precise about validating the PDF,
rather than just ensuring it is probably not corrupt, look into JHOVE
at http://hul.harvard.edu/jhove/ and keep an eye on JHOVE2
http://confluence.ucop.edu/display/JHOVE2Info/Home

-- Larry

 I can see how to write a script that uses the unix command line 'file'
 and 'pdftops' tools to check that every file that looks like a PDF is a
 good and valid PDF. Going from a file on the disk to a database record
 I'm not too sure of.



















 cheers
 stuart
 --
 Stuart Yeates
 http://www.nzetc.org/   New Zealand Electronic Text Centre
 http://researcharchive.vuw.ac.nz/ Institutional Repository

 --
 Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
 -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
 -Strategies to boost innovation and cut costs with open source participation
 -Receive a $600 discount off the registration fee with the source code: SFAD
 http://p.sf.net/sfu/XcvMzF8H
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech


--
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech