[iText-questions] apparently garbage characters extracted from i1040.pdf page 41.

Larry Evans Sat, 21 Sep 2013 12:36:30 -0700

I'm trying to extract the characters from page 41 of:
    www.irs.gov/pub/irs-pdf/i1040.pdf‎
However, using the attached, ExtractPageContentSorted.java, and the member
function, at_page, where:
  reader was produced from:
    www.irs.gov/pub/irs-pdf/i1040.pdf‎
  and pageNum was:
    41
I only managed to produce the output shown in 2nd attachment,
i1040p41.txt.  The characters shown in i1040p41.txt are nothing like
what appears on page 41 of i1040.pdf.  Since the at_page member
function essentially does what Listing 15.27 in the book does:


  http://itextpdf.com/examples/iia.php?id=296

I had expected the charaters to come out OK.

I also tried other text extractors:
  http://poppler.freedesktop.org/
which showed similar garbage characters.

What can be done to *properly* extract the text characters from page
41 of i1040.pdf.

TIA.

-regards,
Larry

package lje;

import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

import lje.OpPage;

public class ExtractPageContentSorted
    implements OpPage
{
    public void pre_pages(PrintWriter out)
    {
    }
    public void at_page(PdfReader reader, int pageNum, PrintWriter out)
        throws IOException
    {
        out.println(PdfTextExtractor.getTextFromPage(reader, pageNum));
    }
}

***Page:41

@&!$&&&.B;A,< 





 


  
   

 
   


  
 
   

 


  
  
 

  
 

 



 

   

   

   


   

   

   

  
 

 



 
 
  
 

 



 
 
    




  

 
2

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

[iText-questions] apparently garbage characters extracted from i1040.pdf page 41.

Reply via email to