Re: deducing table cells in a PDF document

Andreas Lehmkühler Wed, 30 Sep 2009 22:52:42 -0700

Hi,

Pranay Pramod schrieb:
> Thanks Andreas for showing up your interest.
> I am trying to extract text including the table information from PDF
> documents.
> The current capability of PDFBox extracts only plain text.
> 
> using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am able
> to deduce the lines forming the table in a PDF document page. Finally my
> algorithm can make out the individual cells of the table. My code assumes
> the standard coordinate system being used. Whenever I encounter a different
> coordinate system or a different way of rendering the lines of the table(
> shifting the origin for every line draw???), my code breaks for the obvious
> reason.
> 
> The pdf-reference1.7 hints at pre-processing the CTM or the graphic state to
> fetch standard coordinate to my module.
Yes, that's the point. You have to have a look at the ctm AND the
graphics state (see chapter 4.3 of the pdf1.7 reference).


The ctm is used to scale, rotate and shift the coordinates. It is a
little bit too complex to describe hte whole thing in two sentences.
Have a look at the usage of
PDGraphicsState.getCurrentTransformationMatrix() espacially in
PageDrawer.transformedPoint().

Looking at the graphics state the stack is important. It is possible to
save the state to that stack and get it back from the stack. So that you
have to implement that behaviour also, otherwise the graphics states
will be mixed up. In PDFBox the PDFStreamEngine holds this stack.

HTH
Andreas Lehmkühler

Re: deducing table cells in a PDF document

Reply via email to