Hi Andrew, text extraction in DSpace is done by running filter-media (as a cron job). DSpace already uses pdfbox, no need to build and install it. In DSpace the extracted text is in an extra bundle and only visible in administrator functions like editing an item.
Hope that helps Claudia Jürgen > Hello, > > I have just built and installed pdfbox so I could try out its PDF to text > conversion. It seems quite good and very fast, based on the small simple > example I gave it (which was a real technical journal). I am very tempted > to > use this to generate text files for all the PDFs that I will be uploading > into my DSpace. These text files are no good at all for rendering but I am > hoping they will enable a full text search of the articles to be done. > What > do people think of this approach please? > > pdfbox text extraction makes no attempt to create any kind of metadata, it > really is just like doing a string dump of the PDF. Normally Dspace > libraries that make more than just a PDF available make an HTML version > available as well, rather than a text file. The HTML renders reasonably > closely to the PDF. And of course the HTML is what enables the full text > search. I dont see how I can get that with what I have, i.e no HTML, no > metadata files and PDFs that are not OCR'd. So using pdfbox seems to me > like > the only way to get full text searching. > > My "solution" also has another little quirk. When one finds the article > one > is looking for the item page will have the text file as well as the PDF > but > should the user select the text file they will be very puzzled as to why > it > is so poor and indeed why, given its visual quality, it is even there. > -- > Regards, > > Andrew M. > ------------------------------------------------------------------------------ > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, > Nevada. > The future of the web can't happen without you. Join us at MIX09 to help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/_______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > ------------------------------------------------------------------------------ SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

