Thank you for the reply.

I have opened an issue: https://issues.apache.org/jira/browse/PDFBOX-4137

I have attached a revised patch to it - I have found some bugs and
inconsistencies regarding the (erroneous) way I was using to calculate the
target image size.
I also added documentation and javadoc to methods and classes I've added.

I tried to revert all of the formatting changes, but for some reason
IntelliJ keeps bringing them in as it creates the patch (even though I
asked it not to reformat the code).
After some more struggles I think I have managed to "sanitize" the patch
(see latest attachment to the issue).

Theoretically, the changes shouldn't slow down the normal path, as it's
supposed to be methodically the same.
In practice I guess there may be some minor losses due to differences
between e.g. "++x" and "x+=1".
How would I go about testing it, other than collecting many PDFs with lots
of images, and timing the calls to getImage?

I'm not sure I understand your question about scratch file - the
METADATA_ONLY option is currently only passed in the constructor of
PDImageXObject, where the decoding was only done for the benefit of the
"repair" method...

Itai.


On Thu, Mar 1, 2018 at 7:12 PM, Tilman Hausherr <thaush...@t-online.de>
wrote:

> Thanks, sounds interesting. There's definitively a need for that. Just
> create an issue in JIRA with your text and your patch.
> https://issues.apache.org/jira/browse/PDFBOX
>
> About your patch:
> Please remove any changes that are just reformatting. That makes more work
> for us, because it shows more changes than there really are. I try to
> understand everything, not just test that it works. Example:
>
> -                int r = clamp( (1.164f * (Y-16)) + (1.596f * (Cr - 128))
> );
> -                int g = clamp( (1.164f * (Y-16)) + (-0.392f * (Cb-128)) +
> (-0.813f * (Cr-128)));
> -                int b = clamp( (1.164f * (Y-16)) + (2.017f * (Cb-128)));
> +                int r = clamp((1.164f * (Y - 16)) + (1.596f * (Cr -
> 128)));
> +                int g = clamp((1.164f * (Y - 16)) + (-0.392f * (Cb -
> 128)) + (-0.813f * (Cr -
> +                        128)));
> +                int b = clamp((1.164f * (Y - 16)) + (2.017f * (Cb -
> 128)));
>
> Be aware that if your patch changes the public API, then it won't be used
> in the 2.0 branch. (Your patch should still be against the trunk).
>
> Also make sure that your changes in SampledImageReader don't make the
> "normal" path (i.e. reading the entire stream and converting it to an
> image) slower. The current code is the result of several optimizations.
>
> Public API (e.g. DecodeOptions) should have some javadoc. I have no idea
> what "honored" does.
>
> The decode with METADATA_ONLY - does it mean nothing is decoded if there
> is a scratch file???
>
> Tilman
>
>
>
> Am 01.03.2018 um 12:54 schrieb Itai:
>
>> Hello,
>>
>> Following a question asked on pdfbox-users [1] , I set about trying to
>> allow rendering images at lower resolutions, and additionally rendering
>> only parts of images.  The need arises from having very large images,
>> usually JPEG or JBIG2, which are tens of megabytes in size when compressed,
>> but may take up 8 or even more gigabytes when rendered as a BufferedImage
>> at full resolution.
>> I have come up with a solution that seems to work (passes all of the
>> built-in PDFBox tests, and a few manual ones I tried), but since it
>> includes some deep changes in the logic I understand if it won't find its
>> way into PDFBox.
>>
>> While working on it, I also came across PDFBOX-3340 [2], and since my
>> hack relies on making changes to the way filters work, it includes a
>> (partial) fix for that bug too.
>>
>> Finally, since I'm not well versed in git/github, I'm not sure of the
>> best way to share my work. I attach here a unified diff, but let me know if
>> there is another preferred method (pull request? clone the repository?)
>>
>> Following is an explanation/description of my changes, for those
>> interested. I would love to hear any feedback, especially for things which
>> may increase the likelihood of such a feature being included in future
>> versions of PDFBox.
>>
>> Thanks,
>> Itai.
>>
>> --
>>
>> As stated, the issue pertains mainly to very large images (lots of
>> pixels) which are highly compressed. Since DCTFilter, JBIG2Filter etc.
>> render the entire image, I had to augment the way Filter works, to allow it
>> to accept options.
>> This is where the class DecodeOptions comes in. It has sub-region and
>> subsampling options (mirroring those of ImageReadParam), as well as a
>> "metadata only" param. When decoding, you may pass DecodeOptions, such that
>> image-related filters can downscale or only render a part of the image.
>> The "metadata-only" option is used for the `repair` method of
>> PDImageXObject, as it only really needs the DecodeResult - where applicable
>> and possible, a filter encountering this option will not decode the stream,
>> only set the DecodeResult parameters (this is not always possible, e.g. for
>> JPXFilter, which must decode the image to get the parameters).
>>
>> The DecodeOptions also has an "honored" flag, which the filter sets to
>> true if it honored the options - this is needed because when decoding an
>> image stored in a Flate or LZW stream, the filter doesn't know the image
>> format (or does it? I couldn't find a simple way of telling), so it can't
>> make sense of subsampling or partial render options. SampledImageReader
>> checks this flag, and if it is not set to true it does the subsampling by
>> itself.
>>
>> This allows the addition of a method in PDImage
>>
>>      BufferedImage getImage(Rectangle region, int subsample) throws
>> IOException;
>>
>> The result of which is not cached, as it is not "canonical".
>> When drawing an image, PDPageDrawer calculates a subsampling factor based
>> on the desired size:
>>
>>     int subsample = (int)Math.floor(pdImage.getWidth()/at.getScaleX());
>>     if (subsample<1) subsample = 1;
>>     if (subsample>8) subsample = 8;
>>     drawBufferedImage(pdImage.getImage(null, subsample), at);
>>
>> Such that if e.g. the pixel should be drawn at 0.5 times its pixel-size,
>> it will be subsampled at 2-pixel intervals.
>>
>> SampledImageReader issues the corresponding DecodeOptions to
>> PDImage#createInputStream when rendering, and if the "honored" flag is not
>> set, it does its own sub-sampling and partial rendering.
>>
>> I realize most/all of those optimizations won't work for raw, Flate or
>> LZW encoded images, but presumably those won't be too large in the first
>> place. Also, this has little to no benefit for PDInlineImage, but as it
>> already holds all of its raw data I assume little optimization is possible.
>>
>> In general, this hack allowed me to speed-up rendering of some files by
>> significant margins (20%-80%, depending on size and desired DPI), and
>> significantly lower the memory footprint if only a lower-res render is
>> required, or rendering of small regions of the image.
>>
>> --
>>
>> [1]: https://lists.apache.org/thread.html/6b396e3d8bfc4ed44bcadf3
>> 7881035d7447fb711253ef962f187455c@%3Cusers.pdfbox.apache.org%3E
>> [2]: https://issues.apache.org/jira/browse/PDFBOX-3340
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
>
>

Reply via email to