Thanks, sounds interesting. There's definitively a need for that. Just create an issue in JIRA with your text and your patch.
https://issues.apache.org/jira/browse/PDFBOX

About your patch:
Please remove any changes that are just reformatting. That makes more work for us, because it shows more changes than there really are. I try to understand everything, not just test that it works. Example:

-                int r = clamp( (1.164f * (Y-16)) + (1.596f * (Cr - 128)) );
-                int g = clamp( (1.164f * (Y-16)) + (-0.392f * (Cb-128)) + (-0.813f * (Cr-128)));
-                int b = clamp( (1.164f * (Y-16)) + (2.017f * (Cb-128)));
+                int r = clamp((1.164f * (Y - 16)) + (1.596f * (Cr - 128)));
+                int g = clamp((1.164f * (Y - 16)) + (-0.392f * (Cb - 128)) + (-0.813f * (Cr -
+                        128)));
+                int b = clamp((1.164f * (Y - 16)) + (2.017f * (Cb - 128)));

Be aware that if your patch changes the public API, then it won't be used in the 2.0 branch. (Your patch should still be against the trunk).

Also make sure that your changes in SampledImageReader don't make the "normal" path (i.e. reading the entire stream and converting it to an image) slower. The current code is the result of several optimizations.

Public API (e.g. DecodeOptions) should have some javadoc. I have no idea what "honored" does.

The decode with METADATA_ONLY - does it mean nothing is decoded if there is a scratch file???

Tilman


Am 01.03.2018 um 12:54 schrieb Itai:
Hello,

Following a question asked on pdfbox-users [1] , I set about trying to allow rendering images at lower resolutions, and additionally rendering only parts of images.  The need arises from having very large images, usually JPEG or JBIG2, which are tens of megabytes in size when compressed, but may take up 8 or even more gigabytes when rendered as a BufferedImage at full resolution. I have come up with a solution that seems to work (passes all of the built-in PDFBox tests, and a few manual ones I tried), but since it includes some deep changes in the logic I understand if it won't find its way into PDFBox.

While working on it, I also came across PDFBOX-3340 [2], and since my hack relies on making changes to the way filters work, it includes a (partial) fix for that bug too.

Finally, since I'm not well versed in git/github, I'm not sure of the best way to share my work. I attach here a unified diff, but let me know if there is another preferred method (pull request? clone the repository?)

Following is an explanation/description of my changes, for those interested. I would love to hear any feedback, especially for things which may increase the likelihood of such a feature being included in future versions of PDFBox.

Thanks,
Itai.

--

As stated, the issue pertains mainly to very large images (lots of pixels) which are highly compressed. Since DCTFilter, JBIG2Filter etc. render the entire image, I had to augment the way Filter works, to allow it to accept options. This is where the class DecodeOptions comes in. It has sub-region and subsampling options (mirroring those of ImageReadParam), as well as a "metadata only" param. When decoding, you may pass DecodeOptions, such that image-related filters can downscale or only render a part of the image. The "metadata-only" option is used for the `repair` method of PDImageXObject, as it only really needs the DecodeResult - where applicable and possible, a filter encountering this option will not decode the stream, only set the DecodeResult parameters (this is not always possible, e.g. for JPXFilter, which must decode the image to get the parameters).

The DecodeOptions also has an "honored" flag, which the filter sets to true if it honored the options - this is needed because when decoding an image stored in a Flate or LZW stream, the filter doesn't know the image format (or does it? I couldn't find a simple way of telling), so it can't make sense of subsampling or partial render options. SampledImageReader checks this flag, and if it is not set to true it does the subsampling by itself.

This allows the addition of a method in PDImage

     BufferedImage getImage(Rectangle region, int subsample) throws IOException;

The result of which is not cached, as it is not "canonical".
When drawing an image, PDPageDrawer calculates a subsampling factor based on the desired size:

    int subsample = (int)Math.floor(pdImage.getWidth()/at.getScaleX());
    if (subsample<1) subsample = 1;
    if (subsample>8) subsample = 8;
    drawBufferedImage(pdImage.getImage(null, subsample), at);

Such that if e.g. the pixel should be drawn at 0.5 times its pixel-size, it will be subsampled at 2-pixel intervals.

SampledImageReader issues the corresponding DecodeOptions to PDImage#createInputStream when rendering, and if the "honored" flag is not set, it does its own sub-sampling and partial rendering.

I realize most/all of those optimizations won't work for raw, Flate or LZW encoded images, but presumably those won't be too large in the first place. Also, this has little to no benefit for PDInlineImage, but as it already holds all of its raw data I assume little optimization is possible.

In general, this hack allowed me to speed-up rendering of some files by significant margins (20%-80%, depending on size and desired DPI), and significantly lower the memory footprint if only a lower-res render is required, or rendering of small regions of the image.

--

[1]: https://lists.apache.org/thread.html/6b396e3d8bfc4ed44bcadf37881035d7447fb711253ef962f187455c@%3Cusers.pdfbox.apache.org%3E
[2]: https://issues.apache.org/jira/browse/PDFBOX-3340


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


Reply via email to