On Tue, 07 Apr 2015 13:03:27 -0700, Robert De Vivo <[email protected]> wrote:
> I have a requirement to extract study titles from clinical documents in > PDF and MS Word formats. There is no reliable pattern to the text or > the formatting of the titles, so my options for direct querying are > limited. > > Are there any entity enrichment tools which might help to identify study > titles in the clinical-document domain? Temis Luxid looks promising, > but I have not been able to locate the Samples directory on my MarkLogic > AWS image, so I don't know how to get started with that option. > > Bob Have you tried the conversion application that ships with MarkLogic? In addition to the raw format conversion (which extracts the raw metadata), it does some style-based inferencing, which gives you a decent shot at getting the title if it wasn't part of metadata extraction. The post-processing is available for PDF and binary-format MS Word (.doc). For docx MS Word you can give xdmp:document-filter a go and it might extract the title as metadata, but there isn't the style-based inferencing. //Mary _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
