It's worth comparing ML's PDF-to-XML (and XHTML) conversion against the
export facility in Adobe Acrobat 9, if you have it. I've recently been
evaluating the two. Neither is perfect, and they differ in exactly where
their strengths and weaknesses are. It is very difficult to get
letter-perfect XML/XHTML conversion from PDF, if the source is complex,
because the underlying PDF data has all sorts of font changes,
typographic features, and other things that cause "interference" in the
output.
For example, in converting the PDF from a typeset book containing wide
angle brackets (U+2329 / U+232A or similar), the Acrobat export
consistently captured them with styled <span>s, while the MarkLogic
export sometimes captured them and sometimes dropped them or substituted
'( )'. On the other hand, MarkLogic normalized ligature "fi"correctly as
"fi", but Acrobat inserts an extra space, "fi " for no good reason.
MarkLogic's PDF conversion pipelines give you more options over how the
output will be structured than Acrobat does.
DS
On Tue, 28 Jul 2009, Baranov, Ivan - Moscow wrote:
> Hi All
>
> I've recently tried to convert PDF to XML using built-it function
> xdmp:pdf-convert() and discovered that my company's license does not
> allow this. Actually I have my own converter so I just wanted to try
> if ML does it better or faster and now I'm curious about, is there any
> way to acquire such functionality on a trial basis?
> Thanks,
> Van
>
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 801079, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general