Re: [iText-questions] Re: problem: reading text inside pages of a pdf file

Christian Oetterli Wed, 08 Dec 2004 15:30:15 -0800

Hi

I had to solve this Problem myself recently. I have found a piece of c-code (http://www.codeproject.com/cpp/ExtractPDFText.asp, thanks to the author Shar136) and adapted it to java.

It works very fine for me. I use it to index pages of large pdfs (>200 pages).

I hope this is of help to you.

Sincerely
        Christian

Here is the code:

package textextractor;

import java.io.ByteArrayInputStream;
import java.io.FileInputStream;
import java.io.InputStreamReader;

import com.lowagie.text.pdf.PdfReader;

/** * Extracts text out of a pdf page. See main() for an example of usage. Thanks to Shar136 for his * code to extract plain text from a PDF file (http://www.codeproject.com/cpp/ExtractPDFText.asp) * * @author krizz */ public class PdfTextExtractor {

  protected final static int N_OLD_CHAR = 15;

protected static boolean seen2(char search[], char recent[]) { return (recent[N_OLD_CHAR - 3] == search[0] && recent[N_OLD_CHAR - 2] == search[1] && (recent[N_OLD_CHAR - 1] == ' ' || recent[N_OLD_CHAR - 1] == 0x0d || recent[N_OLD_CHAR - 1] == 0x0a) && (recent[N_OLD_CHAR - 4] == ' ' || recent[N_OLD_CHAR - 4] == 0x0d || recent[N_OLD_CHAR - 4] == 0x0a)); }

  protected static float extractNumber(char search[], int lastcharoffset) {
    int i = lastcharoffset;
    while (i > 0 && search[i] == ' ') {
      i--;
    }
    while (i > 0 && (Character.isDigit(search[i]) || search[i] == '.')) {
      i--;
    }
    float flt = -1.0f;
    char buffer[] = new char[N_OLD_CHAR + 5];
    System.arraycopy(search, i + 1, buffer, 0, lastcharoffset - i);
    //    strncpy(buffer, search + i + 1, lastcharoffset - i);
    if (buffer[0] > 0) {
      try {
        flt = Float.parseFloat(String.valueOf(buffer));
      } catch (Exception e) {
        // ignore
      }
    }
    return flt;
  }

/** * This method processes an uncompressed Adobe (text) object and extracts text. * * @param bytes * @param result To where the text is appended to */ public static void extractText(byte bytes[], StringBuffer result) {

try { // convert to string using default encoding InputStreamReader reader = new InputStreamReader(new ByteArrayInputStream(bytes)); int nRead; char buf[] = new char[2048]; StringBuffer content = new StringBuffer(); while ((nRead = reader.read(buf)) != -1) { content.append(buf, 0, nRead); } reader.close();

      final char[] bt = new char[] { 'B', 'T' };
      final char[] et = new char[] { 'E', 'T' };
      final char[] td = new char[] { 'T', 'D' };

      //Are we currently inside a text object?
      boolean intextobject = false;

//Is the next character literal (e.g. \\ to get a \ character or \( to get ( ): boolean nextliteral = false;

      //() Bracket nesting level. Text appears inside ()
      int rbdepth = 0;

      //Keep previous chars to get extract numbers etc.:
      char oc[] = new char[N_OLD_CHAR];
      int j = 0;
      for (j = 0; j < N_OLD_CHAR; j++) {
        oc[j] = ' ';
      }

int len = content.length(); for (int i = 0; i < len; i++) { char c = content.charAt(i); if (intextobject) { if (rbdepth == 0 && seen2(td, oc)) { //Positioning. //See if a new line has to start or just a tab: float num = extractNumber(oc, N_OLD_CHAR - 5); if (num > 1.0) { result.append('\n'); } if (num < 1.0) { result.append('\t'); } } if (rbdepth == 0 && seen2(et, oc)) { //End of a text object, also go to a new line. intextobject = false; result.append('\n'); // fputc(0x0d, file); // fputc(0x0a, file); } else if (c == '(' && rbdepth == 0 && !nextliteral) { //Start outputting text! rbdepth = 1; //See if a space or tab (>1000) is called for by looking //at the number in front of ( float num = extractNumber(oc, N_OLD_CHAR - 1); if (num > 0) { if (num > 1000.0) { result.append('\t'); } else if (num > 100.0) { result.append(' '); } } } else if (c == ')' && rbdepth == 1 && !nextliteral) { //Stop outputting text rbdepth = 0; } else if (rbdepth == 1) { //Just a normal text character: if (c == '\\' && !nextliteral) { //Only print out next character no matter what. Do not interpret. nextliteral = true; } else { nextliteral = false; if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255))) { result.append(c); // fputc(c, file); } } } }

//Store the recent characters for when we have to go back for a number: for (j = 0; j < N_OLD_CHAR - 1; j++) { oc[j] = oc[j + 1]; }

        oc[N_OLD_CHAR - 1] = c;

        if (!intextobject) {
          if (seen2(bt, oc)) {
            //Start of a text object:
            intextobject = true;
          }
        }
      }
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

public static void main(String[] args) { try { if (args.length == 0) { throw new RuntimeException("usage: " + PdfTextExtractor.class.getName() + " <pdf-file>"); }

      String pdf = args[0];

      PdfReader reader = new PdfReader(new FileInputStream(pdf));

      for (int i = 0, n = reader.getNumberOfPages(); i < n; i++) {
        int page = i + 1;
        byte[] content = reader.getPageContent(page);
        StringBuffer text = new StringBuffer();
        extractText(content, text);
        System.out.println("text for page " + page + ":\n" + text);
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }


Bruno wrote:

Quoting Cocchi Maurizio <[EMAIL PROTECTED]>:
Dear Mr. Lowage,
I am really sorry to disturb you, but I have a big problem and I can find
someone helping me.
I need to read the text inside a big pdf file (260 pages).
The problem is to loop the pages and read the text inside every page.
May you help me, or do you know someone that can may help me?
All depends on what you mean with 'read the text'.
If you want the complete page, for instance to use
views on this page in a new PDF document, there should
be no problem.
If you want the content (the sentences, the 'String'),
this is very difficult, if not impossible. iText has
limited functionality to extract Strings, but due to
the nature of PDF and the way PDF parsing is done
in iText, the result probably won't be sufficient for
your needs.
Thanks a lot.
Maurizio Cocchi.
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ iText-questions mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/itext-questions

------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ iText-questions mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/itext-questions

Re: [iText-questions] Re: problem: reading text inside pages of a pdf file

Reply via email to