Re: [iText-questions] How do I extract the coordinate of the words from a pdf document?

Kausik Porel Mon, 14 Jan 2013 06:47:14 -0800

Dear Michael,
Sorry for not providing the code on my previous mail.

I have tried with the following code to extract the coordinate of the
words. But this code mainly gives the position of a line not the word. Can
you please look at the code and suggests. The code is attached with the
mail. This code is a copy of LocationTextExtractionStrategy and added some
codes as per my requirement.


Regards,
Kausik Porel

On Fri, Jan 11, 2013 at 6:39 PM, mkl <m...@wir-sind-cool.org> wrote:

> Kausik,
>
> Kausik Porel wrote
> > But some text block contains multiple words without space and at that
> time
> > it is not able to extract words correctly. I'm filtering it on the basis
> > of position of the text return by TextRenderInfo.
> > For example. suppose there are words : "hello world", when my custom
> > listener extract in TextRenderInfo is as follows he+ll+ow+rld
> >
> > In this case it is not possible to understand the word separation.
> > Can you help me on this.
>
> Unfortunately you did not supply the code of your custom listener. Thus, it
> is hard to say what exactly you are doing wrong.
>
> Most likely you do not check the distance between one TextRenderInfo and
> the
> next one in the same line --- if the distance is very small (which it most
> likely is at the separations you indicated), the texts of those
> TextRenderInfos belong together and are separate in the PDF only for
> kerning. If the distance is big, you most likely have the end of one and
> the
> start of another word.
>
> Kausik Porel wrote
> > Can you provide any code snippet?
>
> You can find some code for inspiration in the iText sources (open source
> after all...). For very orderly content streams have a look at the
> SimpleTextExtractionStrategy and for the generic case at the
> LocationTextExtractionStrategy.
>
> Regards,   Michael
>
>
>
> --
> View this message in context:
> http://itext-general.2136553.n4.nabble.com/How-do-I-extract-the-coordinate-of-the-words-from-a-pdf-document-tp4657306p4657345.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
>
> ------------------------------------------------------------------------------
> Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
> much more. Get web development skills now with LearnDevNow -
> 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
> SALE $99.99 this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122812
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a
> reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples:
> http://itextpdf.com/themes/keywords.php
>

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

namespace PdfScan
{
    public class TextStrategy : ITextExtractionStrategy
    {

        /** set to true for debugging */
        public static bool DUMP_STATE = false;

        /** a summary of all found text */
        private List<TextChunk> locationalResult = new List<TextChunk>();

        /**
         * Creates a new text extraction renderer.
         */
        public TextStrategy()
        {
        }

        /**
         * @see com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
         */
        public virtual void BeginTextBlock()
        {
        }

        /**
         * @see com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
         */
        public virtual void EndTextBlock()
        {
        }

        /**
         * @param str
         * @return true if the string starts with a space character, false if 
the string is empty or starts with a non-space character
         */
        private bool StartsWithSpace(String str)
        {
            if (str.Length == 0) return false;
            return str[0] == ' ';
        }

        /**
         * @param str
         * @return true if the string ends with a space character, false if the 
string is empty or ends with a non-space character
         */
        private bool EndsWithSpace(String str)
        {
            if (str.Length == 0) return false;
            return str[str.Length - 1] == ' ';
        }

        /**
         * Returns the result so far.
         * @return  a String with the resulting text.
         */
        public virtual String GetResultantText()
        {
            string word = "";

            float lastLeft = 0;
            float lastWidth = 0;
            float lastTop = 0;
            float lastHeight = 0;
            float lastRight = 0;



            if (DUMP_STATE) DumpState();

            locationalResult.Sort();

            StringBuilder sb = new StringBuilder();
            TextChunk lastChunk = null;
            foreach (TextChunk chunk in locationalResult)
            {

                if (lastChunk == null)
                {
                    word = chunk.text;  //Stores the word
                    //sb.Append(chunk.text);
                    //create a rectangle to get the cordinate 
                    iTextSharp.text.Rectangle rect = new 
iTextSharp.text.Rectangle(chunk.startLocation[Vector.I1], 
chunk.startLocation[Vector.I2], chunk.endLocation[Vector.I1], 
chunk.topLocation[Vector.I2]);
                    lastLeft = rect.Left; lastRight = rect.Right; lastTop = 
rect.Top; lastHeight = rect.Height; lastWidth = rect.Width;
                    //sb.Append(rect.Left.ToString() + "," + 
rect.Top.ToString() + "," + rect.Width.ToString() + "," + 
rect.Height.ToString());
                }
                else
                {
                    //if get the space, thats means a new word
                    if (chunk.SameLine(lastChunk))
                    {
                        float dist = chunk.DistanceFromEndOf(lastChunk);

                        if (dist < -chunk.charSpaceWidth)
                        {
                            //sb.Append(' ');
                            //sb.Append(word);// + " ");
                            sb.Append("{\"Value\":\"" + word.Replace(",", 
"").Replace(":", "").Replace("-", "").Replace(".", "").Replace("}", "").Trim() 
+ "\",\"Left\":\"" + lastLeft + "\",\"Top\":\"" + lastTop + "\",\"Width\":\"" + 
lastWidth + "\",\"Height\":\"" + lastHeight + "\",\"Right\":\"" + lastRight + 
"\"}\n"); //+ "&nbsp;");
                            word = "";
                            lastLeft = 0;
                            lastWidth = 0;
                            lastTop = 0;
                            lastHeight = 0;
                            lastRight = 0;
                        }
                        // we only insert a blank space if the trailing 
character of the previous string wasn't a space, and the leading character of 
the current string isn't a space
                        else if (dist > chunk.charSpaceWidth / 2.0f && 
!StartsWithSpace(chunk.text) && !EndsWithSpace(lastChunk.text))
                        {
                            //sb.Append(' '); 
                            sb.Append("{\"Value\":\"" + word.Replace(",", 
"").Replace(":", "").Replace("-", "").Replace(".", "").Replace("}", "").Trim() 
+ "\",\"Left\":\"" + lastLeft + "\",\"Top\":\"" + lastTop + "\",\"Width\":\"" + 
lastWidth + "\",\"Height\":\"" + lastHeight + "\",\"Right\":\"" + lastRight + 
"\"}\n"); //+ "&nbsp;");
                            word = "";
                            lastLeft = 0;
                            lastWidth = 0;
                            lastTop = 0;
                            lastHeight = 0;
                            lastRight = 0;
                        }
                        word += chunk.text; //if no space then it is same word
                        //sb.Append(chunk.text);
                        iTextSharp.text.Rectangle rect = new 
iTextSharp.text.Rectangle(chunk.startLocation[Vector.I1], 
chunk.startLocation[Vector.I2], chunk.endLocation[Vector.I1], 
chunk.topLocation[Vector.I2]);
                        lastWidth += rect.Width;//increase the width
                        
                    }
                    else
                    {
                        //new line , starts from begining
                        sb.Append("{\"Value\":\"" + word.Replace(",", 
"").Replace(":", "").Replace("-", "").Replace(".", "").Replace("}", "").Trim() 
+ "\",\"Left\":\"" + lastLeft + "\",\"Top\":\"" + lastTop + "\",\"Width\":\"" + 
lastWidth + "\",\"Height\":\"" + lastHeight + "\",\"Right\":\"" + lastRight + 
"\"}\n"); //+ "&nbsp;");
                        word = "";
                        word = chunk.text;
                        sb.Append('\n');
                        //sb.Append(chunk.text);
                        iTextSharp.text.Rectangle rect = new 
iTextSharp.text.Rectangle(chunk.startLocation[Vector.I1], 
chunk.startLocation[Vector.I2], chunk.topLocation[Vector.I1], 
chunk.topLocation[Vector.I2]);
                        lastLeft = rect.Left; lastRight = rect.Right; lastTop = 
rect.Top; lastHeight = rect.Height; lastWidth = rect.Width;
                        
                    }
                }
                lastChunk = chunk;
            }

            return sb.ToString();

        }

        /** Used for debugging only */
        private void DumpState()
        {
            foreach (TextChunk location in locationalResult)
            {

                location.PrintDiagnostics();

                Console.WriteLine();
            }

        }

        /**
         * 
         * @see 
com.itextpdf.text.pdf.parser.RenderListener#renderText(com.itextpdf.text.pdf.parser.TextRenderInfo)
         */
        public virtual void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
            if (renderInfo.GetRise() != 0)
            { // remove the rise from the baseline - we do this because the 
text from a super/subscript render operations should probably be considered as 
part of the baseline of the text the super/sub is relative to 
                Matrix riseOffsetTransform = new Matrix(0, 
-renderInfo.GetRise());
                segment = segment.TransformBy(riseOffsetTransform);
            }

            TextChunk location = new TextChunk(renderInfo.GetText(), 
segment.GetStartPoint(), segment.GetEndPoint(), topRight, 
renderInfo.GetSingleSpaceWidth());
            locationalResult.Add(location);
        }



        /**
         * Represents a chunk of text, it's orientation, and location relative 
to the orientation vector
         */
        private class TextChunk : IComparable<TextChunk>
        {
            /** the text of the chunk */
            internal String text;
            /** the starting location of the chunk */
            internal Vector startLocation;
            /** the ending location of the chunk */
            internal Vector endLocation;
            /** the top location of the chunk */
            internal Vector topLocation;
            /** unit vector in the orientation of the chunk */
            internal Vector orientationVector;
            /** the orientation as a scalar for quick sorting */
            internal int orientationMagnitude;
            /** perpendicular distance to the orientation unit vector (i.e. the 
Y position in an unrotated coordinate system)
             * we round to the nearest integer to handle the fuzziness of 
comparing floats */
            internal int distPerpendicular;
            /** distance of the start of the chunk parallel to the orientation 
unit vector (i.e. the X position in an unrotated coordinate system) */
            internal float distParallelStart;
            /** distance of the end of the chunk parallel to the orientation 
unit vector (i.e. the X position in an unrotated coordinate system) */
            internal float distParallelEnd;
            /** the width of a single space character in the font of the chunk 
*/
            internal float charSpaceWidth;

            public TextChunk(String str, Vector startLocation, Vector 
endLocation, Vector topLocation, float charSpaceWidth)
            {
                this.text = str;
                this.startLocation = startLocation;
                this.endLocation = endLocation;
                this.charSpaceWidth = charSpaceWidth;

                this.topLocation = topLocation;

                Vector oVector = endLocation.Subtract(startLocation);
                if (oVector.Length == 0)
                {
                    oVector = new Vector(1, 0, 0);
                }
                orientationVector = oVector.Normalize();
                orientationMagnitude = 
(int)(Math.Atan2(orientationVector[Vector.I2], orientationVector[Vector.I1]) * 
1000);

                // see 
http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
                // the two vectors we are crossing are in the same plane, so 
the result will be purely
                // in the z-axis (out of plane) direction, so we just take the 
I3 component of the result
                Vector origin = new Vector(0, 0, 1);
                distPerpendicular = 
(int)(startLocation.Subtract(origin)).Cross(orientationVector)[Vector.I3];

                distParallelStart = orientationVector.Dot(startLocation);
                distParallelEnd = orientationVector.Dot(endLocation);
            }

            public void PrintDiagnostics()
            {
                Console.WriteLine("Text (@" + startLocation + " -> " + 
endLocation + "): " + text);
                Console.WriteLine("orientationMagnitude: " + 
orientationMagnitude);
                Console.WriteLine("distPerpendicular: " + distPerpendicular);
                Console.WriteLine("distParallel: " + distParallelStart);
            }

            /**
             * @param as the location to compare to
             * @return true is this location is on the the same line as the 
other
             */
            public bool SameLine(TextChunk a)
            {
                if (orientationMagnitude != a.orientationMagnitude) return 
false;
                if (distPerpendicular != a.distPerpendicular) return false;
                return true;
            }

            /**
             * Computes the distance between the end of 'other' and the 
beginning of this chunk
             * in the direction of this chunk's orientation vector.  Note that 
it's a bad idea
             * to call this for chunks that aren't on the same line and 
orientation, but we don't
             * explicitly check for that condition for performance reasons.
             * @param other
             * @return the number of spaces between the end of 'other' and the 
beginning of this chunk
             */
            public float DistanceFromEndOf(TextChunk other)
            {
                float distance = distParallelStart - other.distParallelEnd;
                return distance;
            }

            /**
             * Compares based on orientation, perpendicular distance, then 
parallel distance
             * @see java.lang.Comparable#compareTo(java.lang.Object)
             */
            public int CompareTo(TextChunk rhs)
            {
                if (this == rhs) return 0; // not really needed, but just in 
case

                int rslt;
                rslt = CompareInts(orientationMagnitude, 
rhs.orientationMagnitude);
                if (rslt != 0) return rslt;

                rslt = CompareInts(distPerpendicular, rhs.distPerpendicular);
                if (rslt != 0) return rslt;

                // note: it's never safe to check floating point numbers for 
equality, and if two chunks
                // are truly right on top of each other, which one comes first 
or second just doesn't matter
                // so we arbitrarily choose this way.
                rslt = distParallelStart < rhs.distParallelStart ? -1 : 1;

                return rslt;
            }

            /**
             *
             * @param int1
             * @param int2
             * @return comparison of the two integers
             */
            private static int CompareInts(int int1, int int2)
            {
                return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
            }


        }

        /**
         * no-op method - this renderer isn't interested in image events
         * @see 
com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)
         * @since 5.0.1
         */
        public void RenderImage(ImageRenderInfo renderInfo)
        {
            // do nothing
        }
    }

}

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122412

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] How do I extract the coordinate of the words from a pdf document?

Reply via email to