Thanks Jason. Very helpful. Alan
On 6/2/09 8:37 PM, "Jason Hunter" <[email protected]> wrote: On May 28, 2009, at 5:18 PM, Alan Darnell wrote: > I'm wondering if anyone has experience they could share on loading > PDFs into ML and indexing these for text retrieval whie leaving the > PDF in the database for users to download. > > Do you use the CPF to extract text from the PDF and store that as a > new text document in ML? With MarkMail we process attachments that are Office documents and convert them to PDF internally so we can "take their picture" for in- browser display. We call a postproc.xqy after loading each mail message that does this text extraction and various other things like message threading. > If so, how do you link up the PDF and the text document - a common > URL scheme? The primary mail document has links to the attachment binary components: the original file, the PDF version, and the large and small image versions of each page. > Do you extract XMP encoded metadata from the PDFs and use that to > populate properties or create a new XML document associated with the > PDF? We embed the text from the PDF document (extracted via MarkLogic) into the main message document, with page elements around each page's text so we can know on which page we have hits. All the attachment text goes into its own element subsection. We can decide at query time if we want to include or exclude the attachment text and if we want to weight it either higher or lower than message body text. > It would be great to display snippets from the PDF based on the > pages that match the user query (like Google Book Search does). > Is there a way to extract text from the PDF that retains it's page > and position information so you can go back to the PDF to generate a > snippet image? We show text snippets and underline the pages where the matches occur. We don't extract the x,y position to highlight the word, although we've thought about it. I don't think the built-in PDF processing supports that. You'd have to do it with external tools. > Does maintaining the PDFs in the database have a negative impact on > index sizes or performance? With MarkMail we chose to store the PDFs in the database, for various reasons but mostly simplicity. For example, we can apply the MarkLogic security model to the binaries as well as the messages, making them more secure than on an open filesystem. We can also do transactional deletes, removing the message and its supporting binaries in one go. Backups are simpler, too, and always internally consistent between messages and binaries. And there's just one less piece to fail than if we used NFS for a shared binaries filesystem. But yes, binaries in the database do add to forest size and use memory in the expanded tree cache when they're being served and will need to be rewritten during merges. Larger binaries (i.e. movies) have more of an impact than small ones (i.e. images). -jh- _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
