Re: [MarkLogic Dev General] Experience loading PDFs into ML

Jason Hunter Tue, 02 Jun 2009 17:37:11 -0700

On May 28, 2009, at 5:18 PM, Alan Darnell wrote:

I’m wondering if anyone has experience they could share on loadingPDFs into ML and indexing these for text retrieval whie leaving thePDF in the database for users to download.
Do you use the CPF to extract text from the PDF and store that as anew text document in ML?

With MarkMail we process attachments that are Office documents andconvert them to PDF internally so we can "take their picture" for in-browser display. We call a postproc.xqy after loading each mailmessage that does this text extraction and various other things likemessage threading.

If so, how do you link up the PDF and the text document — a commonURL scheme?

The primary mail document has links to the attachment binarycomponents: the original file, the PDF version, and the large andsmall image versions of each page.

Do you extract XMP encoded metadata from the PDFs and use that topopulate properties or create a new XML document associated with thePDF?

We embed the text from the PDF document (extracted via MarkLogic) intothe main message document, with page elements around each page's textso we can know on which page we have hits. All the attachment textgoes into its own element subsection. We can decide at query time ifwe want to include or exclude the attachment text and if we want toweight it either higher or lower than message body text.

It would be great to display snippets from the PDF based on thepages that match the user query (like Google Book Search does).Is there a way to extract text from the PDF that retains it’s pageand position information so you can go back to the PDF to generate asnippet image?

We show text snippets and underline the pages where the matchesoccur. We don't extract the x,y position to highlight the word,although we've thought about it. I don't think the built-in PDFprocessing supports that. You'd have to do it with external tools.

Does maintaining the PDFs in the database have a negative impact onindex sizes or performance?

With MarkMail we chose to store the PDFs in the database, for variousreasons but mostly simplicity. For example, we can apply theMarkLogic security model to the binaries as well as the messages,making them more secure than on an open filesystem. We can also dotransactional deletes, removing the message and its supportingbinaries in one go. Backups are simpler, too, and always internallyconsistent between messages and binaries. And there's just one lesspiece to fail than if we used NFS for a shared binaries filesystem.But yes, binaries in the database do add to forest size and use memoryin the expanded tree cache when they're being served and will need tobe rewritten during merges. Larger binaries (i.e. movies) have moreof an impact than small ones (i.e. images).


-jh-

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Experience loading PDFs into ML

Reply via email to