On Thu, 28 May 2009 17:18:04 -0700, Alan Darnell
<[email protected]> wrote:
I'm wondering if anyone has experience they could share on loading PDFs
into ML and indexing these for text retrieval whie leaving the PDF in
the database for users to download.
Do you use the CPF to extract text from the PDF and store that as a new
text document in ML?
If so, how do you link up the PDF and the text document - a common URL
scheme?
Do you extract XMP encoded metadata from the PDFs and use that to
populate properties or create a new XML document associated with the PDF?
It would be great to display snippets from the PDF based on the pages
that match the user query (like Google Book Search does). Is there a
way to extract text from the PDF that retains it's page and position
information so you can go back to the PDF to generate a snippet image?
Does maintaining the PDFs in the database have a negative impact on
index sizes or performance?
Thanks in advance,
Alan
The default CPF PDF conversion will create a new XHTML version of
the PDF. If you just want the extracted text for searching and not for
rendering, one of the alternative pipelines just extracts the text of each
page and sticks it as a bag of words in a "page" element. Some metadata
is extracted in each case as well. Properties on the documents
connects the source and the conversion products.
//Mary
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general