[iText-questions] [SPAM] Re: remove trailing whitespace and newlines from pdf

mkl Mon, 25 Nov 2013 01:47:38 -0800

Andy,

Andy Newman wrote
> I am attempting to remove whitespace from a pdf which contains vector
> graphics.


Some pointers, not a final solution:

1. As you want to merely trim page, this is not a case of PdfWriter +
getImportedPage usage but instead of PdfStamper usage. Your main code might
look like this:

    PdfReader reader = new PdfReader(resourceStream);
    PdfStamper stamper = new PdfStamper(reader, new
FileOutputStream("target/test-outputs/test-trimmed-stamper.pdf"));
            
    // Go through all pages
    int n = reader.getNumberOfPages();
    for (int i = 1; i <= n; i++)
    {
        Rectangle pageSize = reader.getPageSize(i);
        Rectangle rect = getOutputPageSize(pageSize, reader, i);

        PdfDictionary page = reader.getPageN(i);
        page.put(PdfName.CROPBOX, new PdfArray(new float[]{rect.getLeft(),
rect.getBottom(), rect.getRight(), rect.getTop()}));
        stamper.markUsed(page);
    }
    stamper.close();

   As you see I also added another argument to your getOutputPageSize
method. It is the page number. The amount of white space to trim might
differ on different pages after all.

2. If the source document did not contain vector graphics, you could simply
use the iText parser package classes. There even already is a
TextMarginFinder based on them. In this case the getOutputPageSize method
could look like this:

    private Rectangle getOutputPageSize(Rectangle pageSize, PdfReader
reader, int page) throws IOException
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextMarginFinder finder = parser.processContent(page, new
TextMarginFinder());
        Rectangle result = new Rectangle(finder.getLlx(), finder.getLly(),
finder.getUrx(), finder.getUry());
        System.out.printf("Actual boundary: %f,%f to %f, %f\n",
finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
        return result;
    }

   Using this method with your file test.pdf results in:

<http://itext-general.2136553.n4.nabble.com/file/n4659499/TrimTest.png> 

   As you see the text trims according to text (and bitmap image) content on
the page.

3. To find the bounding box respecting vector graphics, too, you essentially
have to do the same but you have to extend the parser framework used here to
also inform its listeners (The TextMarginFinder essentially is a listener to
drawing events sent from the parser framework). This is non-trivial,
especially if you don't know PDF syntax by heart yet.

4. If your PDFs to trim are not too generic but can be forced to include
some text or bitmap graphics, though, you could use the sample code above
anyways. E.g. if your PDFs always start with text on top and end with text,
you could change getOutputPageSize to create the result rectangle like this:

        Rectangle result = new Rectangle(pageSize.getLeft(),
finder.getLly(), pageSize.getRight(), finder.getUry());

   This only trims top and bottom empty space which might suffice depending
on your requirements.
   Or you can use some other heuristics depending on your knowledge on the
input data. If you know something about the positioning of text (e.g. the
heading to always be centered and some other text to always start at the
left), you can easily extend the TextMarginFinder to take advantage of this
knowledge.

Regards,   Michael



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/remove-trailing-whitespace-and-newlines-from-pdf-tp4659496p4659499.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

[iText-questions] [SPAM] Re: remove trailing whitespace and newlines from pdf

Reply via email to