Re: [MarkLogic Dev General] Title extraction

Mary Holstege Tue, 07 Apr 2015 13:23:18 -0700

On Tue, 07 Apr 2015 13:03:27 -0700, Robert De Vivo <[email protected]>  
wrote:


> I have a requirement to extract study titles from clinical documents in  
> PDF and MS Word formats.  There is no reliable pattern to the text or  
> the formatting of the titles, so my options for direct querying are  
> limited.
>
> Are there any entity enrichment tools which might help to identify study  
> titles in the clinical-document domain?  Temis Luxid looks promising,  
> but I have not been able to locate the Samples directory on my MarkLogic  
> AWS image, so I don't know how to get started with that option.
>
> Bob

Have you tried the conversion application that ships with MarkLogic? In  
addition to the raw format conversion (which extracts the raw metadata),  
it does some style-based inferencing, which gives you a decent shot at  
getting the title if it wasn't part of metadata extraction.  The  
post-processing is available for PDF and binary-format MS Word (.doc). For  
docx MS Word you can give xdmp:document-filter a go and it might extract  
the title as metadata, but there isn't the style-based inferencing.

//Mary
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Title extraction

Reply via email to