Re: Apache PDFBox Board Report April 2018 due

2018-04-09 Thread Maruan Sahyoun
Hi,

two smaller changes you might wan‘t to do

we are planning to release 1.8 ... to support users who are still on Java 1.6 
instead of maybe ...

 don‘t mention me individually for the website it‘s all a teams effort - we are 
improving ...

+1 regardless 

BR
Maruan 

> Am 09.04.2018 um 17:52 schrieb Andreas Lehmkuehler :
> 
> Hi,
> 
> find attached a quick draft of the board report we're expected to submit this
> month. It's based upon the report template which can be found at [1]
> 
> 
> Any further comments, objections or additions?
> 
> 
> 
> ## Description:
> - the Apache PDFBox library is an open source Java tool for working with PDF
>   documents.
> 
> ## Issues:
> - there are no issue requiring board attention at this time.
> 
> ## Activity:
> - we released the first Apache based version of the JBig2 ImageIO plugin last 
> month
> - we are working on fixing bugs and adding smaller improvements to 2.0.x
> - we have already resolved quite a number of 2.0.x releated tickets so that 
> most likely the next bugfix version 2.0.10 will be released soon
> - maybe we are going to release a bugfix version of 1.8.x as well
> - Maruan works on improving our website
> 
> ## Health report:
> - there is a steady stream of contributions, bug reports and questions on the
>   mailing lists
> - we are pleased to see that the amount of patches compared to other 
> contributions is growing. So more and more people are not just using PDFBox 
> but dig deeper into it
> 
> ## PMC changes:
> 
> - Currently 21 PMC members.
> - No new PMC members added in the last 3 months
> - Last PMC addition was Matthäus Mayer on Mon Oct 16 2017
> 
> ## Committer base changes:
> 
> - Currently 21 committers.
> - No new committers added in the last 3 months
> - Last committer addition was Joerg O. Henne at Mon Oct 09 2017
> 
> ## Releases:
> 
> - 2.0.9 was released on Fri Mar 23 2018
> - 3.0.0 JBIG2 was released on Tue Feb 27 2018
> 
> 
> 
> Andreas
> 
> [1] https://reporter.apache.org/?pdfbox
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-sonar » Apache PDFBox #436

2018-04-09 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-sonar #436

2018-04-09 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Apache PDFBox Board Report April 2018 due

2018-04-09 Thread Tilman Hausherr

+1

Tilman

Am 09.04.2018 um 17:52 schrieb Andreas Lehmkuehler:

Hi,

find attached a quick draft of the board report we're expected to 
submit this

month. It's based upon the report template which can be found at [1]


Any further comments, objections or additions?



## Description:
 - the Apache PDFBox library is an open source Java tool for working 
with PDF

   documents.

## Issues:
 - there are no issue requiring board attention at this time.

## Activity:
 - we released the first Apache based version of the JBig2 ImageIO 
plugin last month

 - we are working on fixing bugs and adding smaller improvements to 2.0.x
 - we have already resolved quite a number of 2.0.x releated tickets 
so that most likely the next bugfix version 2.0.10 will be released soon

 - maybe we are going to release a bugfix version of 1.8.x as well
 - Maruan works on improving our website

## Health report:
 - there is a steady stream of contributions, bug reports and 
questions on the

   mailing lists
 - we are pleased to see that the amount of patches compared to other 
contributions is growing. So more and more people are not just using 
PDFBox but dig deeper into it


## PMC changes:

 - Currently 21 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Matthäus Mayer on Mon Oct 16 2017

## Committer base changes:

 - Currently 21 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Joerg O. Henne at Mon Oct 09 2017

## Releases:

 - 2.0.9 was released on Fri Mar 23 2018
 - 3.0.0 JBIG2 was released on Tue Feb 27 2018



Andreas

[1] https://reporter.apache.org/?pdfbox

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430777#comment-16430777
 ] 

Tilman Hausherr commented on PDFBOX-4182:
-

You should open an issue in his project... but I don't know if he can help 
without the files involved.

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: PDFBox 1.8.14 release?

2018-04-09 Thread Tilman Hausherr

+1



Tilman

Am 09.04.2018 um 17:55 schrieb Andreas Lehmkuehler:

Hi,

1.8.13 was released more than a year ago and there a more than 20 
resolved JIRA tickets. How about releasing a new 1.8.x version?


Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: PDFBox 1.8.14 release?

2018-04-09 Thread Maruan Sahyoun
Hi,

+1

Maruan


> Am 09.04.2018 um 17:55 schrieb Andreas Lehmkuehler :
> 
> Hi,
> 
> 1.8.13 was released more than a year ago and there a more than 20 resolved 
> JIRA tickets. How about releasing a new 1.8.x version?
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Pas Filip (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430725#comment-16430725
 ] 

Pas Filip edited comment on PDFBOX-4182 at 4/9/18 3:56 PM:
---

On a completely different note

I've been running some tests based on sambox console command line feature to 
merge mutliple pdfs.

Seems to run faster for small loads but it fails to complete with 10.000+ docs. 
(no memory issue still have 3gb free out of 8gb)

I seem to run into a deadlock here: 
org.sejda.io.FileChannelSeekableSource.position(FileChannelSeekableSource.java:59)

 
||Amount of pdf files||Generated pdfs file size||duration||
|1000|83.469kb|01m45|
|3000|245.073kb|13m15|

 

Might be able to merge large files but takes about 6 times as long. (for 3000 
files)

 


was (Author: pasfilip):
On a completely different note

I've been running some tests based on sambox console command line feature to 
merge mutliple pdfs.

Seems to run faster for small loads but it fails to complete with 10.000+ docs. 
(no memory issue still have 3gb free out of 8gb)

I seem to run into a deadlock here: 
org.sejda.io.FileChannelSeekableSource.position(FileChannelSeekableSource.java:59)

 
||Amount of pdf files||Generated pdfs file size||duration||
|1000|83.469kb|01m45|
|3000|245.073kb|13m15|

 

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced 

[jira] [Comment Edited] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Pas Filip (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430725#comment-16430725
 ] 

Pas Filip edited comment on PDFBOX-4182 at 4/9/18 3:55 PM:
---

On a completely different note

I've been running some tests based on sambox console command line feature to 
merge mutliple pdfs.

Seems to run faster for small loads but it fails to complete with 10.000+ docs. 
(no memory issue still have 3gb free out of 8gb)

I seem to run into a deadlock here: 
org.sejda.io.FileChannelSeekableSource.position(FileChannelSeekableSource.java:59)

 
||Amount of pdf files||Generated pdfs file size||duration||
|1000|83.469kb|01m45|
|3000|245.073kb|13m15|

 


was (Author: pasfilip):
On a completely different note

I've been running some tests based on sambox console command line feature to 
merge mutliple pdfs.

Seems to run faster for small loads but it fails to complete with 10.000+ docs. 
(no memory issue still have 3gb free out of 8gb)

I seem to run into a deadlock here: 
org.sejda.io.FileChannelSeekableSource.position(FileChannelSeekableSource.java:59)

 

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd 

PDFBox 1.8.14 release?

2018-04-09 Thread Andreas Lehmkuehler

Hi,

1.8.13 was released more than a year ago and there a more than 20 resolved JIRA 
tickets. How about releasing a new 1.8.x version?


Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Apache PDFBox Board Report April 2018 due

2018-04-09 Thread Andreas Lehmkuehler

Hi,

find attached a quick draft of the board report we're expected to submit this
month. It's based upon the report template which can be found at [1]


Any further comments, objections or additions?



## Description:
 - the Apache PDFBox library is an open source Java tool for working with PDF
   documents.

## Issues:
 - there are no issue requiring board attention at this time.

## Activity:
 - we released the first Apache based version of the JBig2 ImageIO plugin last 
month

 - we are working on fixing bugs and adding smaller improvements to 2.0.x
 - we have already resolved quite a number of 2.0.x releated tickets so that 
most likely the next bugfix version 2.0.10 will be released soon

 - maybe we are going to release a bugfix version of 1.8.x as well
 - Maruan works on improving our website

## Health report:
 - there is a steady stream of contributions, bug reports and questions on the
   mailing lists
 - we are pleased to see that the amount of patches compared to other 
contributions is growing. So more and more people are not just using PDFBox but 
dig deeper into it


## PMC changes:

 - Currently 21 PMC members.
 - No new PMC members added in the last 3 months
 - Last PMC addition was Matthäus Mayer on Mon Oct 16 2017

## Committer base changes:

 - Currently 21 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Joerg O. Henne at Mon Oct 09 2017

## Releases:

 - 2.0.9 was released on Fri Mar 23 2018
 - 3.0.0 JBIG2 was released on Tue Feb 27 2018



Andreas

[1] https://reporter.apache.org/?pdfbox

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Pas Filip (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430725#comment-16430725
 ] 

Pas Filip commented on PDFBOX-4182:
---

On a completely different note

I've been running some tests based on sambox console command line feature to 
merge mutliple pdfs.

Seems to run faster for small loads but it fails to complete with 10.000+ docs. 
(no memory issue still have 3gb free out of 8gb)

I seem to run into a deadlock here: 
org.sejda.io.FileChannelSeekableSource.position(FileChannelSeekableSource.java:59)

 

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430605#comment-16430605
 ] 

Maruan Sahyoun commented on PDFBOX-4182:


Closing the PDDocument early will also improve the ScratchFile usage.  

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Pas Filip (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430600#comment-16430600
 ] 

Pas Filip commented on PDFBOX-4182:
---

[~tilman] I think introducing the parameter can be useful to improve memory 
usage in the short term.

Ideally re-working the scratchfile may lead to the most gains in memory 
consumption but not as easy...

 

 

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-04-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430593#comment-16430593
 ] 

ASF subversion and git services commented on PDFBOX-4184:
-

Commit 1828725 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1828725 ]

PDFBOX-4184: add comment

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: pdfbox_support_16bit_image_write.patch, 
> png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, 
> png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-04-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430592#comment-16430592
 ] 

ASF subversion and git services commented on PDFBOX-4184:
-

Commit 1828724 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1828724 ]

PDFBOX-4184: complete comment

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: pdfbox_support_16bit_image_write.patch, 
> png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, 
> png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-04-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430535#comment-16430535
 ] 

ASF subversion and git services commented on PDFBOX-4184:
-

Commit 1828715 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1828715 ]

PDFBOX-4184: divide 16 bit alpha values by 256

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: pdfbox_support_16bit_image_write.patch, 
> png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, 
> png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-04-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430536#comment-16430536
 ] 

ASF subversion and git services commented on PDFBOX-4184:
-

Commit 1828716 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1828716 ]

PDFBOX-4184: divide 16 bit alpha values by 256

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: pdfbox_support_16bit_image_write.patch, 
> png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, 
> png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-04-09 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun resolved PDFBOX-4158.

Resolution: Fixed

[~gary.potagal] thank you for the feedback and testing.

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-04-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430483#comment-16430483
 ] 

Tilman Hausherr commented on PDFBOX-4184:
-

Yes this could be a way to pass many options... but I wonder if we should 
change the image creation API again. For now I'd prefer to just add features to 
the existing API.

I think I understand why I wasn't able to reproduce the problem with 
self-generated files. Maybe the files had similar LSB and HSB, but your file 
had them very different so one would notice if only one byte was used.

There's no way we'd use any code from old itext versions, due to the GPL 
license.


> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: pdfbox_support_16bit_image_write.patch, 
> png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, 
> png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4158) COSDocument and PDFMerger may not close all IO resources if closing of one fails

2018-04-09 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430445#comment-16430445
 ] 

Gary Potagal commented on PDFBOX-4158:
--

Yes, we completed testing Friday and are no longer seeing a memory leak / 
orphaned scratch files on disk.  This ticket can be closed.  Thank you.

> COSDocument and PDFMerger may not close all IO resources if closing of one 
> fails
> 
>
> Key: PDFBOX-4158
> URL: https://issues.apache.org/jira/browse/PDFBOX-4158
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.4, 2.0.9, 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: BiggestObjectAllocationGraph.png, BiggestObjectList.png, 
> PDFBOX-4158.patch
>
>
> As observed on the users mailing list  {{COSDocument.close}} and 
> {{PDFMergerUtility.mergeDocuments}} might not close all IO resources if 
> closing of one of the resources fails



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430431#comment-16430431
 ] 

Tilman Hausherr commented on PDFBOX-4182:
-

We could add a parameter to {{mergeDocuments}} like {{early closing}} that is 
false in the call without that parameter. Or {{lateClosing}} that is true in 
the call without parameter. The javadoc should contain a text explaining that 
closing early can be risky in some cases, e.g. the one in PDFBOX-4004.

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430355#comment-16430355
 ] 

Maruan Sahyoun commented on PDFBOX-4182:


[~pasfilip] [~tilman] what about this approach: within PDFMergerUtility we 
develop a new merge method which is targeted to be able to close the PDDocument 
after it has been merged. There will be a flag allowing one to select between 
the new and old merge. One needs to select the old merge to get all the current 
capabilities but over time we add to the new merge method. After doing this we 
will need to look into doing further optimizations, such as a 
different/new/improved 'cache'/ScratchFile to reduce the memory consumption 
further if it might still be needed. This way we will have the ability to 
select the current implementation to handle the special cases for which 
(currently) the PDDocument needs to be available but also have a 'slim' method 
if one only wants to merge basic documents. WDYT?

[~pasfilip] I understand that you can't share the documents. Would it be 
possible to provide a sample set done from scratch which reflects the document 
set you are having. Pointing to publicly available documents is also fine. 
Please keep the elements needed to the bare minimum. 

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of 

[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

2018-04-09 Thread Pas Filip (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430333#comment-16430333
 ] 

Pas Filip commented on PDFBOX-4182:
---

[~msahyoun] I'm afraid I can't share the pdfs as they contain confidential 
information. But basically they are documents asking a customer for payment. It 
contains an image of the EU transfer form and some text as well as a company 
logo.In other words they are very simple pdfs I tested with. I will be 
receiving pdfs with hidden fields and layout instructions in production though. 
Most files I tested with were between 100kb - 140kb.

Sharing the cosstream seems problematic indeed

Memory mapped files sound like a good idea but I'm thinking it will probably 
imply a significant rewrite of some portions of the code.

I'm not familiar enough with the code to be able to estimate if this is 
feasible...

 

> Improve memory usage of PDFMergerUtility
> 
>
> Key: PDFBOX-4182
> URL: https://issues.apache.org/jira/browse/PDFBOX-4182
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9
>Reporter: Pas Filip
>Priority: Major
> Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)