Re: [MarkLogic Dev General] Experience loading PDFs into ML

Alan Darnell Fri, 29 May 2009 19:56:04 -0700

Thanks Tony.  This give me some ideas for handling our content.

Alan

On 5/29/09 11:49 AM, "Apuzzo, Tony" <[email protected]> wrote:

We haven't done the exact process described by the OP.  What we're doing to 
load ~4 million pages of PDF is:
* Use the iText library to split incoming PDFs into separate pages
* Store the PDF split PDF pages into an external web server (We use Weblogic 
with a REST front-end, but a static HTTP would work too.)
* Deliver the PDF files into Marklogic CPF to do text conversion for 
full-text-search.
* Create a top-level asset "DocBook like" XML file that contains the converted 
text + URL references to the page-split PDFs.

Our input PDFs do not have any embedded metadata, so we aren't trying to 
extract anything from them, but we could use iText to extract the PDF 
properties if we needed.

The performance is very good using this scheme and we don't have to worry about 
using ML for BLOBs. We don't have the requirement to do snippet highlighting, 
but we do get to the correct page(s) very easily.

HTH,

Tony Apuzzo
Distinguished Engineer
Flatirons Solutions
http://www.flatironssolutions.com
-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Mary Holstege
Sent: Friday, May 29, 2009 9:16 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Experience loading PDFs into ML

On Thu, 28 May 2009 17:18:04 -0700, Alan Darnell
<[email protected]> wrote:

> I'm wondering if anyone has experience they could share on loading PDFs
> into ML and indexing these for text retrieval whie leaving the PDF in
> the database for users to download.
>
> Do you use the CPF to extract text from the PDF and store that as a new
> text document in ML?
> If so, how do you link up the PDF and the text document - a common URL
> scheme?
> Do you extract XMP encoded metadata from the PDFs and use that to
> populate properties or create a new XML document associated with the PDF?
> It would be great to display snippets from the PDF based on the pages
> that match the user query (like Google Book Search does).  Is there a
> way to extract text from the PDF that retains it's page and position
> information so you can go back to the PDF to generate a snippet image?
> Does maintaining the PDFs in the database have a negative impact on
> index sizes or performance?
>
> Thanks in advance,
>
> Alan

The default CPF PDF conversion will create a new XHTML version of
the PDF. If you just want the extracted text for searching and not for
rendering, one of the alternative pipelines just extracts the text of each
page and sticks it as a bag of words in a "page" element.  Some metadata
is extracted in each case as well.  Properties on the documents
connects the source and the conversion products.

//Mary

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Experience loading PDFs into ML

Reply via email to