On May 28, 2009, at 5:18 PM, Alan Darnell wrote:
I’m wondering if anyone has experience they could share on loading
PDFs into ML and indexing these for text retrieval whie leaving the
PDF in the database for users to download.
Do you use the CPF to extract text from the PDF and store that as a
new text document in ML?
With MarkMail we process attachments that are Office documents and
convert them to PDF internally so we can "take their picture" for in-
browser display. We call a postproc.xqy after loading each mail
message that does this text extraction and various other things like
message threading.
If so, how do you link up the PDF and the text document — a common
URL scheme?
The primary mail document has links to the attachment binary
components: the original file, the PDF version, and the large and
small image versions of each page.
Do you extract XMP encoded metadata from the PDFs and use that to
populate properties or create a new XML document associated with the
PDF?
We embed the text from the PDF document (extracted via MarkLogic) into
the main message document, with page elements around each page's text
so we can know on which page we have hits. All the attachment text
goes into its own element subsection. We can decide at query time if
we want to include or exclude the attachment text and if we want to
weight it either higher or lower than message body text.
It would be great to display snippets from the PDF based on the
pages that match the user query (like Google Book Search does).
Is there a way to extract text from the PDF that retains it’s page
and position information so you can go back to the PDF to generate a
snippet image?
We show text snippets and underline the pages where the matches
occur. We don't extract the x,y position to highlight the word,
although we've thought about it. I don't think the built-in PDF
processing supports that. You'd have to do it with external tools.
Does maintaining the PDFs in the database have a negative impact on
index sizes or performance?
With MarkMail we chose to store the PDFs in the database, for various
reasons but mostly simplicity. For example, we can apply the
MarkLogic security model to the binaries as well as the messages,
making them more secure than on an open filesystem. We can also do
transactional deletes, removing the message and its supporting
binaries in one go. Backups are simpler, too, and always internally
consistent between messages and binaries. And there's just one less
piece to fail than if we used NFS for a shared binaries filesystem.
But yes, binaries in the database do add to forest size and use memory
in the expanded tree cache when they're being served and will need to
be rewritten during merges. Larger binaries (i.e. movies) have more
of an impact than small ones (i.e. images).
-jh-
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general