Re: [MarkLogic Dev General] 2 column Pdf to Xml conversion

Mary Holstege Wed, 10 Feb 2010 09:52:09 -0800

On Wed, 10 Feb 2010 07:36:32 -0800, aniruddha biswas<[email protected]> wrote:

I am getting the xml. But it cannot retain the column position of data.Do you have any idea regarding this? PFA the PDFtoXHTML.cfg file what isbeing used in this query.


The Docbook conversion will work a lot better if you apply it to
the output of the CPF conversion application that ships with the
server, FWIW.  That application does a lot of clean up of the
raw output of PDF conversion as well as some structural
inferencing, and the Docbook conversion does a much better
job if that clean up and inferencing has been done.

PDFtoXHTML.cfg is designed to give a reasonable structural
view of the original PDF with reflowable paragraphs, but the
cost is that you lose the original rendering.  You might try
PDFtoXHTML_exact.cfg.  In CPF this would be to use one
of the alternative PDF conversion pipelines.  The pipeline
"PDF Conversion (Page Layout with Reblocking)" will try
preserve the rendering, but also try to recover some of the
section/subsection information.  If you really don't care
about the section information, you can try the pipeline
"PDF Conversion (Page Layout)" instead.   If you call
xdmp:pdf-convert directly, be sure to add the option:
 <line-breaks xmlns="xdmp:pdf-convert">true</line-breaks>
The downside with exact layout is that you may well
lose the cohesion of the text flow.

The DocBook preserve-styles options will preserve the
class attributes on the elements, but you need to make
sure that all styles are captured as CSS classes.  You may
need to call xhtml:clean() to get some of the style fixups
to happen.  The CSS itself is not extracted in the Docbook
conversion: in general the positional styles will be in the
XHTML document itself, while the font styles will be in
conv.css.  Right now the Docbook conversion does not
gather these up or attach an xml-stylesheet processing
instruction (although perhaps it should).

Next problem what I am facing is that pdf contains many specialcharacters(for scientific notation-gama,kappa,alpha) as well as tabledata. How do I convert the pdf including all these characters and data?


Do you mean that the characters are coming out garbled?  Or that you
want named entities for them?

If it is named entities, look at the server documentation on the
output sgml character entities parameter on the appserver.

I'm assuming the former.  This is generally a font mapping issue.
It is somewhat painful to solve the first time, but once you have solved
it once for that font, it can stay solved.  What is going on is that the
PDF lacks complete font information. Specifically what is missing is
the mapping from font glyph numbers to Unicode codepoints.
The converter has information about many common fonts, and
usually the necessary information is included in the PDF.  Sometimes
the information is incomplete or (depending on the tool that was used
to make the PDF, incorrect, alas).

What you need to do is add a FONT MAP to your PDF config file.
This is in Converters/cvtpdf/ under the install directory.   You can
use an include to do this, if you like, as is done for ZapfDingbats
and Symbol already, or just inline it as is done for WingDings.
Example:
[FONT MAP:*Wingdings*]
Ignore =false;
Preserve Line Breaks =false;
Symbol =false;
Monospace =false;
Serif =false;
Glyph Names =false;
Glyph Start =[;
Glyph End =];
d057 =\x2713;
xfffe = ;
[-- END --]

This maps the codepoint 57 (hex 39, DIGIT NINE) to
codepoint 10003 (hex 2712, CHECK MARK) and codepoint
65534 (hex FFFE, unassigned) to nothing at all.

If only one or two characters are misassigned you can do
this by looking at the codepoint that is produced in output,
figuring out what codepoint you'd like it to be (a nice resource
here is http://people.w3.org/rishida/scripts/uniview/) and
adding the appropriate mapping one at a time.  If a lot of
characters are misassigned, you can use Iceni's Gemini tool
(this part works perfectly well even in demo mode) to map the
characters.  You'll then have to find the FONT MAP it created
in the Gemini.cfg file in its installation directory and copy that
in to your PDF conversion cfg file.

Once this mapping is in place, it will automatically be applied
in future, so you won't need to worry about it for that font.

//Mary


Please help.

Thanks in advance.

Aniruddha


[email protected]
Principal Engineer
Mark Logic Corporation
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] 2 column Pdf to Xml conversion

Reply via email to