I'm trying to extract the characters from page 41 of:
www.irs.gov/pub/irs-pdf/i1040.pdf
However, using the attached, ExtractPageContentSorted.java, and the member
function, at_page, where:
reader was produced from:
www.irs.gov/pub/irs-pdf/i1040.pdf
and pageNum was:
41
I only managed to produce the output shown in 2nd attachment,
i1040p41.txt. The characters shown in i1040p41.txt are nothing like
what appears on page 41 of i1040.pdf. Since the at_page member
function essentially does what Listing 15.27 in the book does:
http://itextpdf.com/examples/iia.php?id=296
I had expected the charaters to come out OK.
I also tried other text extractors:
http://poppler.freedesktop.org/
which showed similar garbage characters.
What can be done to *properly* extract the text characters from page
41 of i1040.pdf.
TIA.
-regards,
Larry
package lje;
import java.io.IOException;
import java.io.PrintWriter;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import lje.OpPage;
public class ExtractPageContentSorted
implements OpPage
{
public void pre_pages(PrintWriter out)
{
}
public void at_page(PdfReader reader, int pageNum, PrintWriter out)
throws IOException
{
out.println(PdfTextExtractor.getTextFromPage(reader, pageNum));
}
}
***Page:41
@ &