On Thu, 28 May 2009 17:18:04 -0700, Alan Darnell <[email protected]> wrote:

I'm wondering if anyone has experience they could share on loading PDFs into ML and indexing these for text retrieval whie leaving the PDF in the database for users to download.

Do you use the CPF to extract text from the PDF and store that as a new text document in ML? If so, how do you link up the PDF and the text document - a common URL scheme? Do you extract XMP encoded metadata from the PDFs and use that to populate properties or create a new XML document associated with the PDF? It would be great to display snippets from the PDF based on the pages that match the user query (like Google Book Search does). Is there a way to extract text from the PDF that retains it's page and position information so you can go back to the PDF to generate a snippet image? Does maintaining the PDFs in the database have a negative impact on index sizes or performance?

Thanks in advance,

Alan

The default CPF PDF conversion will create a new XHTML version of
the PDF. If you just want the extracted text for searching and not for
rendering, one of the alternative pipelines just extracts the text of each
page and sticks it as a bag of words in a "page" element.  Some metadata
is extracted in each case as well.  Properties on the documents
connects the source and the conversion products.

//Mary

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to