Dear Michael,
Sorry for not providing the code on my previous mail.
I have tried with the following code to extract the coordinate of the
words. But this code mainly gives the position of a line not the word. Can
you please look at the code and suggests. The code is attached with the
mail. This code is a copy of LocationTextExtractionStrategy and added some
codes as per my requirement.
Regards,
Kausik Porel
On Fri, Jan 11, 2013 at 6:39 PM, mkl <m...@wir-sind-cool.org> wrote:
> Kausik,
>
> Kausik Porel wrote
> > But some text block contains multiple words without space and at that
> time
> > it is not able to extract words correctly. I'm filtering it on the basis
> > of position of the text return by TextRenderInfo.
> > For example. suppose there are words : "hello world", when my custom
> > listener extract in TextRenderInfo is as follows he+ll+ow+rld
> >
> > In this case it is not possible to understand the word separation.
> > Can you help me on this.
>
> Unfortunately you did not supply the code of your custom listener. Thus, it
> is hard to say what exactly you are doing wrong.
>
> Most likely you do not check the distance between one TextRenderInfo and
> the
> next one in the same line --- if the distance is very small (which it most
> likely is at the separations you indicated), the texts of those
> TextRenderInfos belong together and are separate in the PDF only for
> kerning. If the distance is big, you most likely have the end of one and
> the
> start of another word.
>
> Kausik Porel wrote
> > Can you provide any code snippet?
>
> You can find some code for inspiration in the iText sources (open source
> after all...). For very orderly content streams have a look at the
> SimpleTextExtractionStrategy and for the generic case at the
> LocationTextExtractionStrategy.
>
> Regards, Michael
>
>
>
> --
> View this message in context:
> http://itext-general.2136553.n4.nabble.com/How-do-I-extract-the-coordinate-of-the-words-from-a-pdf-document-tp4657306p4657345.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
>
> ------------------------------------------------------------------------------
> Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
> much more. Get web development skills now with LearnDevNow -
> 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
> SALE $99.99 this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122812
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a
> reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples:
> http://itextpdf.com/themes/keywords.php
>
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;
namespace PdfScan
{
public class TextStrategy : ITextExtractionStrategy
{
/** set to true for debugging */
public static bool DUMP_STATE = false;
/** a summary of all found text */
private List<TextChunk> locationalResult = new List<TextChunk>();
/**
* Creates a new text extraction renderer.
*/
public TextStrategy()
{
}
/**
* @see com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
*/
public virtual void BeginTextBlock()
{
}
/**
* @see com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
*/
public virtual void EndTextBlock()
{
}
/**
* @param str
* @return true if the string starts with a space character, false if
the string is empty or starts with a non-space character
*/
private bool StartsWithSpace(String str)
{
if (str.Length == 0) return false;
return str[0] == ' ';
}
/**
* @param str
* @return true if the string ends with a space character, false if the
string is empty or ends with a non-space character
*/
private bool EndsWithSpace(String str)
{
if (str.Length == 0) return false;
return str[str.Length - 1] == ' ';
}
/**
* Returns the result so far.
* @return a String with the resulting text.
*/
public virtual String GetResultantText()
{
string word = "";
float lastLeft = 0;
float lastWidth = 0;
float lastTop = 0;
float lastHeight = 0;
float lastRight = 0;
if (DUMP_STATE) DumpState();
locationalResult.Sort();
StringBuilder sb = new StringBuilder();
TextChunk lastChunk = null;
foreach (TextChunk chunk in locationalResult)
{
if (lastChunk == null)
{
word = chunk.text; //Stores the word
//sb.Append(chunk.text);
//create a rectangle to get the cordinate
iTextSharp.text.Rectangle rect = new
iTextSharp.text.Rectangle(chunk.startLocation[Vector.I1],
chunk.startLocation[Vector.I2], chunk.endLocation[Vector.I1],
chunk.topLocation[Vector.I2]);
lastLeft = rect.Left; lastRight = rect.Right; lastTop =
rect.Top; lastHeight = rect.Height; lastWidth = rect.Width;
//sb.Append(rect.Left.ToString() + "," +
rect.Top.ToString() + "," + rect.Width.ToString() + "," +
rect.Height.ToString());
}
else
{
//if get the space, thats means a new word
if (chunk.SameLine(lastChunk))
{
float dist = chunk.DistanceFromEndOf(lastChunk);
if (dist < -chunk.charSpaceWidth)
{
//sb.Append(' ');
//sb.Append(word);// + " ");
sb.Append("{\"Value\":\"" + word.Replace(",",
"").Replace(":", "").Replace("-", "").Replace(".", "").Replace("}", "").Trim()
+ "\",\"Left\":\"" + lastLeft + "\",\"Top\":\"" + lastTop + "\",\"Width\":\"" +
lastWidth + "\",\"Height\":\"" + lastHeight + "\",\"Right\":\"" + lastRight +
"\"}\n"); //+ " ");
word = "";
lastLeft = 0;
lastWidth = 0;
lastTop = 0;
lastHeight = 0;
lastRight = 0;
}
// we only insert a blank space if the trailing
character of the previous string wasn't a space, and the leading character of
the current string isn't a space
else if (dist > chunk.charSpaceWidth / 2.0f &&
!StartsWithSpace(chunk.text) && !EndsWithSpace(lastChunk.text))
{
//sb.Append(' ');
sb.Append("{\"Value\":\"" + word.Replace(",",
"").Replace(":", "").Replace("-", "").Replace(".", "").Replace("}", "").Trim()
+ "\",\"Left\":\"" + lastLeft + "\",\"Top\":\"" + lastTop + "\",\"Width\":\"" +
lastWidth + "\",\"Height\":\"" + lastHeight + "\",\"Right\":\"" + lastRight +
"\"}\n"); //+ " ");
word = "";
lastLeft = 0;
lastWidth = 0;
lastTop = 0;
lastHeight = 0;
lastRight = 0;
}
word += chunk.text; //if no space then it is same word
//sb.Append(chunk.text);
iTextSharp.text.Rectangle rect = new
iTextSharp.text.Rectangle(chunk.startLocation[Vector.I1],
chunk.startLocation[Vector.I2], chunk.endLocation[Vector.I1],
chunk.topLocation[Vector.I2]);
lastWidth += rect.Width;//increase the width
}
else
{
//new line , starts from begining
sb.Append("{\"Value\":\"" + word.Replace(",",
"").Replace(":", "").Replace("-", "").Replace(".", "").Replace("}", "").Trim()
+ "\",\"Left\":\"" + lastLeft + "\",\"Top\":\"" + lastTop + "\",\"Width\":\"" +
lastWidth + "\",\"Height\":\"" + lastHeight + "\",\"Right\":\"" + lastRight +
"\"}\n"); //+ " ");
word = "";
word = chunk.text;
sb.Append('\n');
//sb.Append(chunk.text);
iTextSharp.text.Rectangle rect = new
iTextSharp.text.Rectangle(chunk.startLocation[Vector.I1],
chunk.startLocation[Vector.I2], chunk.topLocation[Vector.I1],
chunk.topLocation[Vector.I2]);
lastLeft = rect.Left; lastRight = rect.Right; lastTop =
rect.Top; lastHeight = rect.Height; lastWidth = rect.Width;
}
}
lastChunk = chunk;
}
return sb.ToString();
}
/** Used for debugging only */
private void DumpState()
{
foreach (TextChunk location in locationalResult)
{
location.PrintDiagnostics();
Console.WriteLine();
}
}
/**
*
* @see
com.itextpdf.text.pdf.parser.RenderListener#renderText(com.itextpdf.text.pdf.parser.TextRenderInfo)
*/
public virtual void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
if (renderInfo.GetRise() != 0)
{ // remove the rise from the baseline - we do this because the
text from a super/subscript render operations should probably be considered as
part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0,
-renderInfo.GetRise());
segment = segment.TransformBy(riseOffsetTransform);
}
TextChunk location = new TextChunk(renderInfo.GetText(),
segment.GetStartPoint(), segment.GetEndPoint(), topRight,
renderInfo.GetSingleSpaceWidth());
locationalResult.Add(location);
}
/**
* Represents a chunk of text, it's orientation, and location relative
to the orientation vector
*/
private class TextChunk : IComparable<TextChunk>
{
/** the text of the chunk */
internal String text;
/** the starting location of the chunk */
internal Vector startLocation;
/** the ending location of the chunk */
internal Vector endLocation;
/** the top location of the chunk */
internal Vector topLocation;
/** unit vector in the orientation of the chunk */
internal Vector orientationVector;
/** the orientation as a scalar for quick sorting */
internal int orientationMagnitude;
/** perpendicular distance to the orientation unit vector (i.e. the
Y position in an unrotated coordinate system)
* we round to the nearest integer to handle the fuzziness of
comparing floats */
internal int distPerpendicular;
/** distance of the start of the chunk parallel to the orientation
unit vector (i.e. the X position in an unrotated coordinate system) */
internal float distParallelStart;
/** distance of the end of the chunk parallel to the orientation
unit vector (i.e. the X position in an unrotated coordinate system) */
internal float distParallelEnd;
/** the width of a single space character in the font of the chunk
*/
internal float charSpaceWidth;
public TextChunk(String str, Vector startLocation, Vector
endLocation, Vector topLocation, float charSpaceWidth)
{
this.text = str;
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
this.topLocation = topLocation;
Vector oVector = endLocation.Subtract(startLocation);
if (oVector.Length == 0)
{
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.Normalize();
orientationMagnitude =
(int)(Math.Atan2(orientationVector[Vector.I2], orientationVector[Vector.I1]) *
1000);
// see
http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
// the two vectors we are crossing are in the same plane, so
the result will be purely
// in the z-axis (out of plane) direction, so we just take the
I3 component of the result
Vector origin = new Vector(0, 0, 1);
distPerpendicular =
(int)(startLocation.Subtract(origin)).Cross(orientationVector)[Vector.I3];
distParallelStart = orientationVector.Dot(startLocation);
distParallelEnd = orientationVector.Dot(endLocation);
}
public void PrintDiagnostics()
{
Console.WriteLine("Text (@" + startLocation + " -> " +
endLocation + "): " + text);
Console.WriteLine("orientationMagnitude: " +
orientationMagnitude);
Console.WriteLine("distPerpendicular: " + distPerpendicular);
Console.WriteLine("distParallel: " + distParallelStart);
}
/**
* @param as the location to compare to
* @return true is this location is on the the same line as the
other
*/
public bool SameLine(TextChunk a)
{
if (orientationMagnitude != a.orientationMagnitude) return
false;
if (distPerpendicular != a.distPerpendicular) return false;
return true;
}
/**
* Computes the distance between the end of 'other' and the
beginning of this chunk
* in the direction of this chunk's orientation vector. Note that
it's a bad idea
* to call this for chunks that aren't on the same line and
orientation, but we don't
* explicitly check for that condition for performance reasons.
* @param other
* @return the number of spaces between the end of 'other' and the
beginning of this chunk
*/
public float DistanceFromEndOf(TextChunk other)
{
float distance = distParallelStart - other.distParallelEnd;
return distance;
}
/**
* Compares based on orientation, perpendicular distance, then
parallel distance
* @see java.lang.Comparable#compareTo(java.lang.Object)
*/
public int CompareTo(TextChunk rhs)
{
if (this == rhs) return 0; // not really needed, but just in
case
int rslt;
rslt = CompareInts(orientationMagnitude,
rhs.orientationMagnitude);
if (rslt != 0) return rslt;
rslt = CompareInts(distPerpendicular, rhs.distPerpendicular);
if (rslt != 0) return rslt;
// note: it's never safe to check floating point numbers for
equality, and if two chunks
// are truly right on top of each other, which one comes first
or second just doesn't matter
// so we arbitrarily choose this way.
rslt = distParallelStart < rhs.distParallelStart ? -1 : 1;
return rslt;
}
/**
*
* @param int1
* @param int2
* @return comparison of the two integers
*/
private static int CompareInts(int int1, int int2)
{
return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
}
}
/**
* no-op method - this renderer isn't interested in image events
* @see
com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)
* @since 5.0.1
*/
public void RenderImage(ImageRenderInfo renderInfo)
{
// do nothing
}
}
}
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php