Hi Aniruddha, You might be interested in the Default conversion option that comes with the Content Processing Framework. It supports PDF to xml conversion in several flavours amongst which a few that try to preserve Page Layout information. The PDF conversion in the Default conversion option also adds postprocessing to the straight forward xdmp:pdf-convert. Not sure it provides all you need, but it is an interesting feature anyhow..
You can read more on CPF here: http://developer.marklogic.com/pubs/4.1/books/cpf.pdf, see chapter 9 for details on the Default conversion option. Kind regards, Geert > drs. G.P.H. (Geert) Josten Consultant Daidalos BV Hoekeindsehof 1-4 2665 JZ Bleiswijk T +31 (0)10 850 1200 F +31 (0)10 850 1199 mailto:[email protected] http://www.daidalos.nl/ KvK 27164984 P Please consider the environment before printing this mail. De informatie - verzonden in of met dit e-mailbericht - is afkomstig van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen rechten worden ontleend. > From: [email protected] > [mailto:[email protected]] On Behalf Of > aniruddha biswas > Sent: woensdag 10 februari 2010 16:37 > To: [email protected] > Subject: [MarkLogic Dev General] 2 column Pdf to Xml conversion > > Hi All, > > I am a new developer to Mark Logic. I need your help > regarding the following: > > I have a 2-column pdf. I have already ingested this pdf into > Mark Logic. I need to make a docbook xml from this pdf. I am > using the following query for this conversion: > > xquery version '0.9-ml' > import module namespace dbk = > 'http://marklogic.com/cpf/docbook'at > '/MarkLogic/conversion/docbook.xqy' > let $results := > xdmp:pdf-convert(doc('10747_2007_article_bf02760200.pdf'),'107 > 47_2007_article_bf02760200.pdf') > let $xhtml := $results[2] > let $options := <options xmlns='dbk:convert'> > <wrap-text>true</wrap-text> <preserve-styles>true</preserve-styles> > </options> > return dbk:convert($xhtml, $options)[2] > > > I am getting the xml. But it cannot retain the column > position of data. Do you have any idea regarding this? PFA > the PDFtoXHTML.cfg file what is being used in this query. > > Next problem what I am facing is that pdf contains many > special characters(for scientific notation-gama,kappa,alpha) > as well as table data. How do I convert the pdf including all > these characters and data? > > Please help. > > Thanks in advance. > > Aniruddha > > > > ________________________________ > > The INTERNET now has a personality. YOURS! See your Yahoo! > Homepage > <http://in.rd.yahoo.com/tagline_yyi_1/*http://in.yahoo.com/> . > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
