Hi, Pranay Pramod schrieb: > Thanks Andreas for showing up your interest. > I am trying to extract text including the table information from PDF > documents. > The current capability of PDFBox extracts only plain text. > > using the graphics operator moveTo (m), lineTo(l), Rectangle(re), I am able > to deduce the lines forming the table in a PDF document page. Finally my > algorithm can make out the individual cells of the table. My code assumes > the standard coordinate system being used. Whenever I encounter a different > coordinate system or a different way of rendering the lines of the table( > shifting the origin for every line draw???), my code breaks for the obvious > reason. > > The pdf-reference1.7 hints at pre-processing the CTM or the graphic state to > fetch standard coordinate to my module. Yes, that's the point. You have to have a look at the ctm AND the graphics state (see chapter 4.3 of the pdf1.7 reference).
The ctm is used to scale, rotate and shift the coordinates. It is a little bit too complex to describe hte whole thing in two sentences. Have a look at the usage of PDGraphicsState.getCurrentTransformationMatrix() espacially in PageDrawer.transformedPoint(). Looking at the graphics state the stack is important. It is possible to save the state to that stack and get it back from the stack. So that you have to implement that behaviour also, otherwise the graphics states will be mixed up. In PDFBox the PDFStreamEngine holds this stack. HTH Andreas Lehmkühler