Re: [MarkLogic Dev General] Experience loading PDFs into ML

Mary Holstege Fri, 29 May 2009 08:14:58 -0700

On Thu, 28 May 2009 17:18:04 -0700, Alan Darnell<[email protected]> wrote:

I'm wondering if anyone has experience they could share on loading PDFsinto ML and indexing these for text retrieval whie leaving the PDF inthe database for users to download.
Do you use the CPF to extract text from the PDF and store that as a newtext document in ML?If so, how do you link up the PDF and the text document - a common URLscheme?Do you extract XMP encoded metadata from the PDFs and use that topopulate properties or create a new XML document associated with the PDF?It would be great to display snippets from the PDF based on the pagesthat match the user query (like Google Book Search does). Is there away to extract text from the PDF that retains it's page and positioninformation so you can go back to the PDF to generate a snippet image?Does maintaining the PDFs in the database have a negative impact onindex sizes or performance?
Thanks in advance,

Alan


The default CPF PDF conversion will create a new XHTML version of
the PDF. If you just want the extracted text for searching and not for
rendering, one of the alternative pipelines just extracts the text of each
page and sticks it as a bag of words in a "page" element.  Some metadata
is extracted in each case as well.  Properties on the documents
connects the source and the conversion products.

//Mary

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Experience loading PDFs into ML

Reply via email to