RE: [MarkLogic Dev General] 2 column Pdf to Xml conversion

Geert Josten Wed, 10 Feb 2010 08:38:58 -0800

Hi Aniruddha,

You might be interested in the Default conversion option that comes with the 
Content Processing Framework. It supports PDF to xml conversion in several 
flavours amongst which a few that try to preserve Page Layout information. The 
PDF conversion in the Default conversion option also adds postprocessing to the 
straight forward xdmp:pdf-convert. Not sure it provides all you need, but it is 
an interesting feature anyhow..


You can read more on CPF here: 
http://developer.marklogic.com/pubs/4.1/books/cpf.pdf, see chapter 9 for 
details on the Default conversion option.

Kind regards,
Geert

>


drs. G.P.H. (Geert) Josten
Consultant


Daidalos BV
Hoekeindsehof 1-4
2665 JZ Bleiswijk

T +31 (0)10 850 1200
F +31 (0)10 850 1199

mailto:[email protected]
http://www.daidalos.nl/

KvK 27164984

P Please consider the environment before printing this mail.
De informatie - verzonden in of met dit e-mailbericht - is afkomstig van 
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit 
bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit 
bericht kunnen geen rechten worden ontleend.

> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> aniruddha biswas
> Sent: woensdag 10 februari 2010 16:37
> To: [email protected]
> Subject: [MarkLogic Dev General] 2 column Pdf to Xml conversion
>
> Hi All,
>
> I am a new developer to Mark Logic. I need your help
> regarding the following:
>
> I have a 2-column pdf. I have already ingested this pdf into
> Mark Logic. I need to make a docbook xml from this pdf. I am
> using the following query for this conversion:
>
> xquery version '0.9-ml'
> import module namespace dbk =
> 'http://marklogic.com/cpf/docbook'at
> '/MarkLogic/conversion/docbook.xqy'
> let $results :=
> xdmp:pdf-convert(doc('10747_2007_article_bf02760200.pdf'),'107
> 47_2007_article_bf02760200.pdf')
> let $xhtml := $results[2]
> let $options := <options xmlns='dbk:convert'>
> <wrap-text>true</wrap-text> <preserve-styles>true</preserve-styles>
> </options>
> return dbk:convert($xhtml, $options)[2]
>
>
> I am getting the xml. But it cannot retain the column
> position of data. Do you have any idea regarding this? PFA
> the PDFtoXHTML.cfg file what is being used in this query.
>
> Next problem what I am facing is that pdf contains many
> special characters(for scientific notation-gama,kappa,alpha)
> as well as table data. How do I convert the pdf including all
> these characters and data?
>
> Please help.
>
> Thanks in advance.
>
> Aniruddha
>
>
>
> ________________________________
>
> The INTERNET now has a personality. YOURS! See your Yahoo!
> Homepage
> <http://in.rd.yahoo.com/tagline_yyi_1/*http://in.yahoo.com/> .
>
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] 2 column Pdf to Xml conversion

Reply via email to