[MarkLogic Dev General] Experience loading PDFs into ML

Alan Darnell Thu, 28 May 2009 17:18:15 -0700

I'm wondering if anyone has experience they could share on loading PDFs into ML 
and indexing these for text retrieval whie leaving the PDF in the database for 
users to download.


Do you use the CPF to extract text from the PDF and store that as a new text 
document in ML?
If so, how do you link up the PDF and the text document - a common URL scheme?
Do you extract XMP encoded metadata from the PDFs and use that to populate 
properties or create a new XML document associated with the PDF?
It would be great to display snippets from the PDF based on the pages that 
match the user query (like Google Book Search does).  Is there a way to extract 
text from the PDF that retains it's page and position information so you can go 
back to the PDF to generate a snippet image?
Does maintaining the PDFs in the database have a negative impact on index sizes 
or performance?

Thanks in advance,

Alan

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] Experience loading PDFs into ML

Reply via email to