Re: [Dspace-tech] searching, PDFs, HTML and XML

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Fri, 12 Dec 2008 12:12:07 -0800

     Question:  If a .pdf document contains, let's say, 1 page in the
middle of a document that contains an image (a drawing for instance), is
filter-media going to fail on the filtering of this document or will it
just skip the image and continue to filter what it can?


     I have made some modifications to the DSpace 1.4.2 filter-media
process so that if a document cannot be filtered for whatever reason
(unreadable characters, java heap space error, etc), the bitstream_id
for that document gets written to a local table.  Before
MediaFilterManager.java even attempts to filter a document, it checks
that local table to see if that bitstream_id exists in the table.  If it
does, it will not even *attempt* to filter that document and instead
increments a counter and a last-date-skipped column in the local table.
A periodic report is sent to the Users and they inspect the document to
see if it didn't OCR correctly, etc.  If appropriate, they will rescan
the original document, delete the old document, and upload the new
document into DSpace.  Since the new document has a new bitstream_id,
filter-media will attempt to filter it that night and the process
repeats.

     The best thing about this mod is that it can save hours of
processing time, especially with documents where a previous filtering
attempt has resulted in a Java heap space error.  Sometimes the
filtering attempt will actually run for hours before it fails with the
Java heap space error on a document.  Simply adding the bitstream_id for
this document to our local table will eliminate a subsequent filtering
attempt and filter-media runs and completes much faster.

     I would be happy to share this code with anyone who is interested.

     Please let me know if anyone can answer my question about filtering
results with a .pdf document that contains 1 or more unfilterable
images.

Thanks in advance,

Sue Walker-Thornton
ConITS Contract
NASA Langley Research Center
Integrated Library Systems Application & Database Administrator
130 Research Drive
Hampton, VA  23666
Office: (757) 224-4074
Fax:    (757) 224-4001
Pager: (757) 988-2547 
Email:  susan.m.thorn...@nasa.gov


-----Original Message-----
From: Shane Beers [mailto:sbe...@gmu.edu] 
Sent: Friday, December 12, 2008 10:31 AM
To: Andrew Marlow
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] searching, PDFs, HTML and XML

Andrew:
Performing OCR on a PDF document is, as far as I know, the most widely  
used method to search a PDF document. Is there a specific reason you  
do not want the PDFs to be searchable? Even the archival "standard" of  
PDF/A (archival PDF) allows for OCR.

I use the commercial product ABBYY Finereader for a variety of  
solutions. From their web site: "When you are converting documents for  
editing, ABBYY FineReader 9.0 exports the results directly to your  
favorite applications including Microsoft Word, Microsoft Excel,  
Microsoft PowerPoint, and Adobe Acrobat/Reader. In addition,  
recognized text can be saved in a variety of file formats, including  
PDF, PDF/A, HTML, Microsoft Word XML, DOC/DOCX, RTF, XLS/ XLSX, PPT,  
DBF, CSV, TXT, and LIT. "

It looks like this would be able to fit your needs. However, I would  
be of the opinion that just performing OCR would be the most direct  
and stable option.

Addtionally, you can upload multiple bitstreams per item in DSpace.  
The first page of the ingest process asks if the item contains  
multiple files, and you would answer in the affirmative. Additionally,  
you can edit individual items bitstreams as an admin after they are  
already in the archive.

Shane Beers
Digital Repository Services Librarian
George Mason University
sbe...@gmu.edu
http://mars.gmu.edu
703-993-3742



On Dec 12, 2008, at 3:44 AM, Andrew Marlow wrote:

> Hello,
>
> Now that I have loaded a few PDFs into my DSpace repo, I am  
> wondering how to enable full text searching. The PDFs happen to be  
> in a form that means they cannot be searched directly. So when I  
> search in DSpace I get no results returned (unless the text also  
> appears in the abstract I entered manually). If I could find a way  
> to convert the PDF to HTML this might do the trick but if it it, I  
> think it would be working for the wrong reasons. According to me  
> limited research, the proper way to enable full text search in  
> digital libraries is to have the documents in XML form. This raises  
> a few DSpace questions.
>
> I do not actually see anywhere in DSpace where I can upload an XML  
> (assuming I find a way to generate one from the PDF).
>
> I suspect that DSpace expects to be able to perform full text  
> searching using the HTML rather than using XML. This would work,  
> kindof, but with XML I think it works a whole lot better due to the  
> metadata in the XML. An XML approach would require some sort of  
> schema. I do not know of any standards in this area.
>
> Have I got it right/wrong? Am I barking up the wrong tree? I think I  
> might need a lesson from a seasoned DSpacer on how full text  
> searching is done when the PDFs are not searchable. Googling I find  
> that other digital libraries, e.g those not based on DSpace, tend to  
> approach the problem in their own way. For example, solutions based  
> on Mark Logic are able to take advantage of a Mark Logic feature  
> where it generates the XML from the PDF when the PDF is uploaded.
>
> -- 
> Regards,
>
> Andrew M.
>
------------------------------------------------------------------------
------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,  
> Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to  
> help
> pave the way to the Next Web now. Learn more and register at
>
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.
com/_______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


------------------------------------------------------------------------
------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
Nevada.
The future of the web can't happen without you.  Join us at MIX09 to
help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.
com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] searching, PDFs, HTML and XML

Reply via email to