Larry Evans wrote
> I'm trying to extract the characters from page 41 of:
>      www.irs.gov/pub/irs-pdf/i1040.pdf‎
> However, using the attached, ExtractPageContentSorted.java, and the member
> function, at_page, where:
>    reader was produced from:
>      www.irs.gov/pub/irs-pdf/i1040.pdf‎
>    and pageNum was:
>      41
> I only managed to produce the output shown in 2nd attachment,
> i1040p41.txt.  The characters shown in i1040p41.txt are nothing like what
> appears on page 41 of i1040.pdf.  Since the at_page member function
> essentially does what Listing 15.27 in the book does:
> 
> What can be done to *properly* extract the text characters from page 41 of
> i1040.pdf.

When I apply

>     final static File TEST_FILE_IRS1040 = new File("data/i1040.pdf");
>     final static File TEST_FILE_IRS1040_TEXT = new
> File("data/out/i1040_41.txt");
>     
>     public void testExtractIrs1040() throws DocumentException, IOException
>     {
>       System.out.printf("\n\nFile %s:", TEST_FILE_IRS1040);
>         PdfReader reader = new PdfReader(TEST_FILE_IRS1040.toString());
>         String text = PdfTextExtractor.getTextFromPage(reader, 41);
>         
>         System.out.println();
>         System.out.printf(">>>%s<<<\n", text);
>         
>         FileOutputStream fos = new
> FileOutputStream(TEST_FILE_IRS1040_TEXT);
>         fos.write(text.getBytes());
>         fos.close();
>     }

I get

> Page 41 of 108 Fileid: … ions/I1040/2012/A/XML/Cycle10/source 21:06 -
> 18-Jan-2013
> The type and rule above prints on all proofs including departmental
> reproduction proofs. MUST be removed before printing.
> 2012 Form 1040—Line 44
> Qualified Dividends and Capital Gain Tax Worksheet—Line 44 Keep for Your
> Records
> Before you begin:
> See the earlier instructions for line 44 to see if you can use this
> worksheet to figure your tax.
> Before completing this worksheet, complete Form 1040 through line 43.
> If you do not have to file Schedule D and you received capital gain
> distributions, be sure you checked 
> the box on line 13 of Form 1040.
> 1. Enter the amount from Form 1040, line 43. However, if you are filing
> Form 
> 2555 or 2555-EZ (relating to foreign earned income), enter the amount from 
>  
> line 3 of the Foreign Earned Income Tax Worksheet .....................1.
> 2. Enter the amount from Form 1040, line 9b* ....... 
> 2.
> 3. Are you filing Schedule D?*
> Yes. Enter the smaller of line 15 or 16 of 
> Schedule D. If either line 15 or line 16 is 
>  
> blank or a loss, enter -0-  3.
> No. Enter the amount from Form 1040, line 13
> 4. Add lines 2 and 3 .............................. 
> 4.
> 5. If filing Form 4952 (used to figure investment 
> interest expense deduction), enter any amount from 
>  
> line 4g of that form. Otherwise, enter -0- .......... 5.
> 6. Subtract line 5 from line 4. If zero or less, enter -0-
> ...................... 
> 6.
> 7. Subtract line 6 from line 1. If zero or less, enter -0-
> ...................... 
> 7.
> 8. Enter: 
> $35,350 if single or married filing separately, 
> $70,700 if married filing jointly or qualifying widow(er),
>  
> ............. 8.
> $47,350 if head of household. 
> 9. Enter the smaller of line 1 or line 8
> .................................... 
> 9.
> 10. Enter the smaller of line 7 or line 9
> .................................... 
> 10.
> 11. Subtract line 10 from line 9. This amount is taxed at 0%
> .................. 
> 11.
> 12. Enter the smaller of line 1 or line 6
> .................................... 
> 12.
> 13. Enter the amount from line 11 ........................................ 
> 13.
> 14. Subtract line 13 from line 12
> ......................................... 
> 14.
> 15. Multiply line 14 by 15% (.15)
> ......................................................... 
> 15.
> 16. Figure the tax on the amount on line 7. If the amount on line 7 is
> less than $100,000, use the Tax 
> Table to figure this tax. If the amount on line 7 is $100,000 or more, use
> the Tax Computation 
>  
> Worksheet
> .........................................................................
> 16.
> 17. Add lines 15 and 16
> .................................................................. 
> 17.
> 18. Figure the tax on the amount on line 1. If the amount on line 1 is
> less than $100,000, use the Tax 
> Table to figure this tax. If the amount on line 1 is $100,000 or more, use
> the Tax Computation 
>  
> Worksheet
> .........................................................................
> 18.
> 19. Tax on all taxable income. Enter the smaller of line 17 or line 18.
> Also include this amount on 
> Form 1040, line 44. If you are filing Form 2555 or 2555-EZ, do not enter
> this amount on Form 
>  
> 1040, line 44. Instead, enter it on line 4 of the Foreign Earned Income
> Tax Worksheet ......... 19.
> *If you are filing Form 2555 or 2555-EZ, see the footnote in the Foreign
> Earned Income Tax Worksheet before completing this line.
> -41-
> Need more information or forms? Visit IRS.gov.

which looks ok. Thus,there is some other issue in your setup, maybe your
PrintWriter uses a destructive encoding, or maybe some other program down
the pipeline does something weird. Or maybe some old iText version?

Regards, Michael



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/apparently-garbage-characters-extracted-from-i1040-pdf-page-41-tp4659185p4659192.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to