Darren, FDnC Red wrote > I'm attaching the sample PDF that I'm parsing, the Program.cs that I'm > using, and an XML file which is the output of MuPDF's mudraw.exe which I'm > using as "ground truth" data because the stroke_path matrix exactly > matches where the lines are in the PDF.
I am predominantly working on the Java side, so I had to translate your program to Java. Then I looked at its output, and that output matches the MuPDF output exactly (looking at it from the correct angle): FindPdfLines output: Start X,Y= 19.96,538.9747 Length=716.89 Height=0.0 1.0 0.0 0.0 0.0 1.0 0.0 19.96 538.9747 1.0 Start X,Y= 19.96,399.63 Length=716.89 Height=0.0 1.0 0.0 0.0 0.0 1.0 0.0 19.96 399.63 1.0 Start X,Y= 19.96,268.3525 Length=716.89 Height=0.0 1.0 0.0 0.0 0.0 1.0 0.0 19.96 268.3525 1.0 Start X,Y= 19.96,141.3561 Length=716.89 Height=0.0 1.0 0.0 0.0 0.0 1.0 0.0 19.96 141.3561 1.0 Start X,Y= 184.01,538.4 Length=0.0 Height=509.96 1.0 0.0 0.0 0.0 1.0 0.0 184.01 538.4 1.0 Start X,Y= 368.6952,659.88 Length=0.0 Height=631.44 1.0 0.0 0.0 0.0 1.0 0.0 368.6952 659.88 1.0 Start X,Y= 561.25,538.4 Length=0.0 Height=509.96 1.0 0.0 0.0 0.0 1.0 0.0 561.25 538.4 1.0 MuPDF output: <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 199.025"> <moveto x="0" y="0"/> <lineto x="716.89" y="0"/> </stroke_path> <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 338.37"> <moveto x="0" y="0"/> <lineto x="716.89" y="0"/> </stroke_path> <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 469.647"> <moveto x="0" y="0"/> <lineto x="716.89" y="0"/> </stroke_path> <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 596.644"> <moveto x="0" y="0"/> <lineto x="716.89" y="0"/> </stroke_path> <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 1" matrix="1 0 0 -1 184.01 199.6"> <moveto x="0" y="0"/> <lineto x="0" y="-509.96"/> </stroke_path> <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 1" matrix="1 0 0 -1 368.695 78.12"> <moveto x="0" y="0"/> <lineto x="0" y="-631.44"/> </stroke_path> <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 1" matrix="1 0 0 -1 561.25 199.6"> <moveto x="0" y="0"/> <lineto x="0" y="-509.96"/> </stroke_path> Let's look at the first line to resolve the ostensible differences: Start X,Y= 19.96,538.9747 Length=716.89 Height=0.0 1.0 0.0 0.0 0.0 1.0 0.0 19.96 538.9747 1.0 <stroke_path linewidth="0.5" miterlimit="4" linecap="0,0,0" linejoin="0" colorspace="DeviceCMYK" color="0 0 0 0.5" matrix="1 0 0 -1 19.96 199.025"> <moveto x="0" y="0"/> <lineto x="716.89" y="0"/> </stroke_path> The obvious difference is that FindPdfLines already applied the transformation matrix while MuPDF has not yet applied it.After applying it for the MuPDF data, the starting point is at (x,y) = (19.96, 199.025). So the x coordinates already match, but the y coordinates seem to not match at all. But they only /seem/ to not match. As soon as one realizes that the outputs are given in different coordinate systems, they do match! iTextSharp gives you the coordinates in the native PDF default user space coordinates, i.e. it uses the PDF page media box (in your file [0.0 0.0 756.0 738.0]) with /0,0 being the lower left corner and 756,738 being the upper right/. MuPDF, on the other hand, uses a different coordinate system more common in other image formats with /0,0 being the upper left corner and 756,738 being the lower right/. To transform the coordinates of an individual point between these coordinate systems, you keep the x coordinate and subtract the y coordinate from 738. After doing that transformation (and allowing for minor differences due to the lossy float arithmetic), the coordinate match: FindPdfLines: 19.96,538.9747 MuPDF: 19.96,538.975 (=738-199.025) The same is true for the other lines. That being said your code will work for very special documents only because 1) You assume the code for lines to always be that identical sequence of operations with differences only in the cm and l operands. In general there can be other operations in-between (e.g. operations setting the color or rendering mode, or even whole text blocks). Furthermore the operands of the m operator need not be 0 0. And, of course, some of your operands are nor required, e.g. there need not be q, Q, or cm operators at all. 2) You process cm and Q only if they are preceded by operands according to your assumed sequence of operators of a line. Thus, you only process some of the concatenated transformation matrix and you only undo (restore state) some transformation matrix changes. 3) By applying Math.Abs to the l operands, you throw away the information whether the line goes left or right from the starting point, and whether it goes up or down. Thus, your code may serve as a proof of concept but not for general use. Regards, Michael Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Detect-Lines-in-PDF-tp4660295p4660349.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php