Thanks Tony. This give me some ideas for handling our content. Alan
On 5/29/09 11:49 AM, "Apuzzo, Tony" <[email protected]> wrote: We haven't done the exact process described by the OP. What we're doing to load ~4 million pages of PDF is: * Use the iText library to split incoming PDFs into separate pages * Store the PDF split PDF pages into an external web server (We use Weblogic with a REST front-end, but a static HTTP would work too.) * Deliver the PDF files into Marklogic CPF to do text conversion for full-text-search. * Create a top-level asset "DocBook like" XML file that contains the converted text + URL references to the page-split PDFs. Our input PDFs do not have any embedded metadata, so we aren't trying to extract anything from them, but we could use iText to extract the PDF properties if we needed. The performance is very good using this scheme and we don't have to worry about using ML for BLOBs. We don't have the requirement to do snippet highlighting, but we do get to the correct page(s) very easily. HTH, Tony Apuzzo Distinguished Engineer Flatirons Solutions http://www.flatironssolutions.com -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Mary Holstege Sent: Friday, May 29, 2009 9:16 AM To: General Mark Logic Developer Discussion Subject: Re: [MarkLogic Dev General] Experience loading PDFs into ML On Thu, 28 May 2009 17:18:04 -0700, Alan Darnell <[email protected]> wrote: > I'm wondering if anyone has experience they could share on loading PDFs > into ML and indexing these for text retrieval whie leaving the PDF in > the database for users to download. > > Do you use the CPF to extract text from the PDF and store that as a new > text document in ML? > If so, how do you link up the PDF and the text document - a common URL > scheme? > Do you extract XMP encoded metadata from the PDFs and use that to > populate properties or create a new XML document associated with the PDF? > It would be great to display snippets from the PDF based on the pages > that match the user query (like Google Book Search does). Is there a > way to extract text from the PDF that retains it's page and position > information so you can go back to the PDF to generate a snippet image? > Does maintaining the PDFs in the database have a negative impact on > index sizes or performance? > > Thanks in advance, > > Alan The default CPF PDF conversion will create a new XHTML version of the PDF. If you just want the extracted text for searching and not for rendering, one of the alternative pipelines just extracts the text of each page and sticks it as a bag of words in a "page" element. Some metadata is extracted in each case as well. Properties on the documents connects the source and the conversion products. //Mary _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
