Re: [Dspace-tech] searching, PDFs, HTML and XML

Shane Beers Fri, 12 Dec 2008 07:31:52 -0800

Andrew:
Performing OCR on a PDF document is, as far as I know, the most widely  
used method to search a PDF document. Is there a specific reason you  
do not want the PDFs to be searchable? Even the archival "standard" of  
PDF/A (archival PDF) allows for OCR.


I use the commercial product ABBYY Finereader for a variety of  
solutions. From their web site: "When you are converting documents for  
editing, ABBYY FineReader 9.0 exports the results directly to your  
favorite applications including Microsoft Word, Microsoft Excel,  
Microsoft PowerPoint, and Adobe Acrobat/Reader. In addition,  
recognized text can be saved in a variety of file formats, including  
PDF, PDF/A, HTML, Microsoft Word XML, DOC/DOCX, RTF, XLS/ XLSX, PPT,  
DBF, CSV, TXT, and LIT. "

It looks like this would be able to fit your needs. However, I would  
be of the opinion that just performing OCR would be the most direct  
and stable option.

Addtionally, you can upload multiple bitstreams per item in DSpace.  
The first page of the ingest process asks if the item contains  
multiple files, and you would answer in the affirmative. Additionally,  
you can edit individual items bitstreams as an admin after they are  
already in the archive.

Shane Beers
Digital Repository Services Librarian
George Mason University
sbe...@gmu.edu
http://mars.gmu.edu
703-993-3742



On Dec 12, 2008, at 3:44 AM, Andrew Marlow wrote:

> Hello,
>
> Now that I have loaded a few PDFs into my DSpace repo, I am  
> wondering how to enable full text searching. The PDFs happen to be  
> in a form that means they cannot be searched directly. So when I  
> search in DSpace I get no results returned (unless the text also  
> appears in the abstract I entered manually). If I could find a way  
> to convert the PDF to HTML this might do the trick but if it it, I  
> think it would be working for the wrong reasons. According to me  
> limited research, the proper way to enable full text search in  
> digital libraries is to have the documents in XML form. This raises  
> a few DSpace questions.
>
> I do not actually see anywhere in DSpace where I can upload an XML  
> (assuming I find a way to generate one from the PDF).
>
> I suspect that DSpace expects to be able to perform full text  
> searching using the HTML rather than using XML. This would work,  
> kindof, but with XML I think it works a whole lot better due to the  
> metadata in the XML. An XML approach would require some sort of  
> schema. I do not know of any standards in this area.
>
> Have I got it right/wrong? Am I barking up the wrong tree? I think I  
> might need a lesson from a seasoned DSpacer on how full text  
> searching is done when the PDFs are not searchable. Googling I find  
> that other digital libraries, e.g those not based on DSpace, tend to  
> approach the problem in their own way. For example, solutions based  
> on Mark Logic are able to take advantage of a Mark Logic feature  
> where it generates the XML from the PDF when the PDF is uploaded.
>
> -- 
> Regards,
>
> Andrew M.
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,  
> Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to  
> help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] searching, PDFs, HTML and XML

Reply via email to