Andrew: Performing OCR on a PDF document is, as far as I know, the most widely used method to search a PDF document. Is there a specific reason you do not want the PDFs to be searchable? Even the archival "standard" of PDF/A (archival PDF) allows for OCR.
I use the commercial product ABBYY Finereader for a variety of solutions. From their web site: "When you are converting documents for editing, ABBYY FineReader 9.0 exports the results directly to your favorite applications including Microsoft Word, Microsoft Excel, Microsoft PowerPoint, and Adobe Acrobat/Reader. In addition, recognized text can be saved in a variety of file formats, including PDF, PDF/A, HTML, Microsoft Word XML, DOC/DOCX, RTF, XLS/ XLSX, PPT, DBF, CSV, TXT, and LIT. " It looks like this would be able to fit your needs. However, I would be of the opinion that just performing OCR would be the most direct and stable option. Addtionally, you can upload multiple bitstreams per item in DSpace. The first page of the ingest process asks if the item contains multiple files, and you would answer in the affirmative. Additionally, you can edit individual items bitstreams as an admin after they are already in the archive. Shane Beers Digital Repository Services Librarian George Mason University sbe...@gmu.edu http://mars.gmu.edu 703-993-3742 On Dec 12, 2008, at 3:44 AM, Andrew Marlow wrote: > Hello, > > Now that I have loaded a few PDFs into my DSpace repo, I am > wondering how to enable full text searching. The PDFs happen to be > in a form that means they cannot be searched directly. So when I > search in DSpace I get no results returned (unless the text also > appears in the abstract I entered manually). If I could find a way > to convert the PDF to HTML this might do the trick but if it it, I > think it would be working for the wrong reasons. According to me > limited research, the proper way to enable full text search in > digital libraries is to have the documents in XML form. This raises > a few DSpace questions. > > I do not actually see anywhere in DSpace where I can upload an XML > (assuming I find a way to generate one from the PDF). > > I suspect that DSpace expects to be able to perform full text > searching using the HTML rather than using XML. This would work, > kindof, but with XML I think it works a whole lot better due to the > metadata in the XML. An XML approach would require some sort of > schema. I do not know of any standards in this area. > > Have I got it right/wrong? Am I barking up the wrong tree? I think I > might need a lesson from a seasoned DSpacer on how full text > searching is done when the PDFs are not searchable. Googling I find > that other digital libraries, e.g those not based on DSpace, tend to > approach the problem in their own way. For example, solutions based > on Mark Logic are able to take advantage of a Mark Logic feature > where it generates the XML from the PDF when the PDF is uploaded. > > -- > Regards, > > Andrew M. > ------------------------------------------------------------------------------ > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, > Nevada. > The future of the web can't happen without you. Join us at MIX09 to > help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------------ SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech