[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434982#comment-16434982
 ] 

Maruan Sahyoun commented on PDFBOX-4188:


Good timing as just I wanted to start working on PDFBOX-4182 this will allow to 
test if there is some improvement. What's the idea of the patch you are working 
on? Should I wait for that?

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-sonar #438

2018-04-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434644#comment-16434644
 ] 

Gary Potagal commented on PDFBOX-4188:
--

I'm working on merging the patch that we did for 2.0.4 to current trunk.  I'll 
try to have it available shortly for your review

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4071) Improve code quality (3)

2018-04-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434522#comment-16434522
 ] 

ASF subversion and git services commented on PDFBOX-4071:
-

Commit 1828933 from [~tilman] in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1828933 ]

PDFBOX-4071: add option to use KCMS

> Improve code quality (3)
> 
>
> Key: PDFBOX-4071
> URL: https://issues.apache.org/jira/browse/PDFBOX-4071
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.8
>Reporter: Tilman Hausherr
>Priority: Major
> Attachments: pdfbox-screenshot-bad.png, pdfbox-screenshot-good.png
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2852, which was getting too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-04-11 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434485#comment-16434485
 ] 

Tilman Hausherr commented on PDFBOX-4184:
-

If you'd like, I'd take an improved patch against the current version of 
LosslessFactory... Something that goes a new path if the image is 16 bit and 
the raster type that is supported by your code (interleaved). I.e. a 
combination of your existing patch and the code of your comment.

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
> Attachments: pdfbox_support_16bit_image_write.patch, 
> png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, 
> png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-Trunk-jdk9 » Apache FontBox #425

2018-04-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-Trunk-jdk9 #425

2018-04-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434416#comment-16434416
 ] 

Gary Potagal commented on PDFBOX-4188:
--

[~tilman] - we created a breaking test and it's attached 
[^PDFBOX-4188-breakingTest.zip].  

The patch is binary, so you would need to apply it in the checked out trunk 
directory using the command:

trunk> patch -p0 --binary -i PDFBOX-4188-breakingTest.diff

patching file 
pdfbox/src/test/java/org/apache/pdfbox/multipdf/PdfMergeUtilityPagesTest.java
patching file pdfbox/src/test/resources/input/merge/pages/pdf_sample_1.pdf

 

The test does the following:
 # Creates four folders containing copies of one page simple pdf_sample_1.pdf 
file. Each folders contain increasing number of copies, starting with 100, so 
it's 100, 200, 300, 400 .  Each file is about 8K
 # Merges all files in each folder.  The numbers in test for maxStorageBytes 
are just enough to let the test pass.  If you decrease them slightly, the 
Exception will be thrown.  

 

Output looks like this:

Apr 11, 2018 2:25:32 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 0.781; 
Pages/Second: 128.041; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 74; 
Total Sources Size(K): 775; Merged File Size(K): 522; Ratio 
MaxStorageBytes/Merged File Size: 145
Apr 11, 2018 2:25:34 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 1.486; 
Pages/Second: 134.590; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 315; 
Total Sources Size(K): 1,551; Merged File Size(K): 1,042; Ratio 
MaxStorageBytes/Merged File Size: 309
Apr 11, 2018 2:25:37 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3.532; 
Pages/Second: 84.938; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 710; 
Total Sources Size(K): 2,327; Merged File Size(K): 1,562; Ratio 
MaxStorageBytes/Merged File Size: 465
Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
runMergeTest
INFO: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4.677; 
Pages/Second: 85.525; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1,240; 
Total Sources Size(K): 3,103; Merged File Size(K): 2,082; Ratio 
MaxStorageBytes/Merged File Size: 609
Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest 
testPerformanceMerge
INFO: Summary: Pages: 1000, Time(s): 10.476, Pages/Second: 95.456

As you can see, to merge 400 one page 8K files, We need to set maxStorageBytes 
to ~1.2 GIG.  The resulting file is ~2000 K

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  

Jenkins build is back to normal : PDFBox-trunk » Apache Preflight #3966

2018-04-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-trunk #3966

2018-04-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Attachment: PDFBOX-4188-breakingTest.zip

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434177#comment-16434177
 ] 

Tilman Hausherr edited comment on PDFBOX-4188 at 4/11/18 5:34 PM:
--

From: Tilman Hausherr (mail addr removed) 
 Sent: Saturday, April 07, 2018 1:48 AM
 To: dev@pdfbox.apache.org
 Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when 
merging large number of small PDFs

 

Hi,

 

Please have also a look at the comments in

https://issues.apache.org/jira/browse/PDFBOX-4182  

Please submit your patch proposal there or in a new issue. It should be against 
the trunk. Note that this doesn't mean your patch will be accepted, it just 
means I'd like to see it because I haven't understood your post fully, and many 
attachment types don't get through here.

 

A breaking test would be interesting: is it possible to use (or better,

create) 400 identical small PDFs and merge them and does it break?

 

Tilman

 


was (Author: gary.potagal):
From: Tilman Hausherr [|mailto:thaush...@t-online.de] 
 Sent: Saturday, April 07, 2018 1:48 AM
 To: dev@pdfbox.apache.org
 Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when 
merging large number of small PDFs

 

Hi,

 

Please have also a look at the comments in

https://issues.apache.org/jira/browse/PDFBOX-4182  

Please submit your patch proposal there or in a new issue. It should be against 
the trunk. Note that this doesn't mean your patch will be accepted, it just 
means I'd like to see it because I haven't understood your post fully, and many 
attachment types don't get through here.

 

A breaking test would be interesting: is it possible to use (or better,

create) 400 identical small PDFs and merge them and does it break?

 

Tilman

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434177#comment-16434177
 ] 

Tilman Hausherr edited comment on PDFBOX-4188 at 4/11/18 5:33 PM:
--

From: Tilman Hausherr [|mailto:thaush...@t-online.de] 
 Sent: Saturday, April 07, 2018 1:48 AM
 To: dev@pdfbox.apache.org
 Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when 
merging large number of small PDFs

 

Hi,

 

Please have also a look at the comments in

https://issues.apache.org/jira/browse/PDFBOX-4182  

Please submit your patch proposal there or in a new issue. It should be against 
the trunk. Note that this doesn't mean your patch will be accepted, it just 
means I'd like to see it because I haven't understood your post fully, and many 
attachment types don't get through here.

 

A breaking test would be interesting: is it possible to use (or better,

create) 400 identical small PDFs and merge them and does it break?

 

Tilman

 


was (Author: gary.potagal):
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
 Sent: Saturday, April 07, 2018 1:48 AM
 To: dev@pdfbox.apache.org
 Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when 
merging large number of small PDFs

 

Hi,

 

Please have also a look at the comments in

https://issues.apache.org/jira/browse/PDFBOX-4182  

Please submit your patch proposal there or in a new issue. It should be against 
the trunk. Note that this doesn't mean your patch will be accepted, it just 
means I'd like to see it because I haven't understood your post fully, and many 
attachment types don't get through here.

 

A breaking test would be interesting: is it possible to use (or better,

create) 400 identical small PDFs and merge them and does it break?

 

Tilman

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4188:

Description: 
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened PDFBOX-3721 where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  (See 
PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you

  was:
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you


>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary 

[jira] [Commented] (PDFBOX-4187) Refactor LosslessFactory alpha

2018-04-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434272#comment-16434272
 ] 

ASF subversion and git services commented on PDFBOX-4187:
-

Commit 1828915 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1828915 ]

PDFBOX-4187: simplify code

> Refactor LosslessFactory alpha
> --
>
> Key: PDFBOX-4187
> URL: https://issues.apache.org/jira/browse/PDFBOX-4187
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
>
> While looking into the code for PDFBOX-4184 I noticed that we try to get the 
> alpha data in different ways, despite that it is available in the main method 
> when {{image.getRGB()}} is called. So I'm refactoring all this; as a side 
> effect, my 16 bit change in PDFBOX-4184 is no longer needed.
> I'll commit this in two steps; 1) changing the main method and remove the 
> ones that are no longer being used; 2) split the main method in a gray and a 
> color code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4187) Refactor LosslessFactory alpha

2018-04-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434273#comment-16434273
 ] 

ASF subversion and git services commented on PDFBOX-4187:
-

Commit 1828916 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1828916 ]

PDFBOX-4187: simplify code

> Refactor LosslessFactory alpha
> --
>
> Key: PDFBOX-4187
> URL: https://issues.apache.org/jira/browse/PDFBOX-4187
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.10, 3.0.0 PDFBox
>
>
> While looking into the code for PDFBOX-4184 I noticed that we try to get the 
> alpha data in different ways, despite that it is available in the main method 
> when {{image.getRGB()}} is called. So I'm refactoring all this; as a side 
> effect, my 16 bit change in PDFBOX-4184 is no longer needed.
> I'll commit this in two steps; 1) changing the main method and remove the 
> ones that are no longer being used; 2) split the main method in a gray and a 
> color code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434191#comment-16434191
 ] 

Gary Potagal commented on PDFBOX-4188:
--

We don't know what PDFs we're going to get and are trying to make this generic

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened
> https://issues.apache.org/jira/browse/PDFBOX-3721 
> where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  See 
> (https://issues.apache.org/jira/browse/PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434188#comment-16434188
 ] 

Maruan Sahyoun commented on PDFBOX-4188:


Are the documents you are using the elements described in PDFBOX-3999 and 
PDFBOX-4003?

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened
> https://issues.apache.org/jira/browse/PDFBOX-3721 
> where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  See 
> (https://issues.apache.org/jira/browse/PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Description: 
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you

  was:
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999 is addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you


>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 

[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434177#comment-16434177
 ] 

Gary Potagal commented on PDFBOX-4188:
--

From: Tilman Hausherr [mailto:thaush...@t-online.de] 
 Sent: Saturday, April 07, 2018 1:48 AM
 To: dev@pdfbox.apache.org
 Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when 
merging large number of small PDFs

 

Hi,

 

Please have also a look at the comments in

https://issues.apache.org/jira/browse/PDFBOX-4182  

Please submit your patch proposal there or in a new issue. It should be against 
the trunk. Note that this doesn't mean your patch will be accepted, it just 
means I'd like to see it because I haven't understood your post fully, and many 
attachment types don't get through here.

 

A breaking test would be interesting: is it possible to use (or better,

create) 400 identical small PDFs and merge them and does it break?

 

Tilman

 

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened
> https://issues.apache.org/jira/browse/PDFBOX-3721 
> where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  See 
> (https://issues.apache.org/jira/browse/PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Description: 
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999 is addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you

  was:
I have been running some tests trying to merge large amounts (2618) of small 
pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)

Memory consumption seems to be the main limitation.

ScratchFileBuffer seems to consume the majority of the memory usage.

(see screenshot from mat in attachment)

(I would include the hprof in attachment so you can analyze yourselves but it's 
rather large)

Note that it seems impossible to generate a large pdf using a small memory 
footprint.

I personally thought that using MemorySettings with temporary file only would 
allow me to generate arbitrarily large pdf files but it doesn't seem to help.

I've run the mergeDocuments with  memory settings:
 * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 
1024L)

 * MemoryUsageSetting.setupTempFileOnly()

Refactored version completes with *4GB* heap:

with temp file only completes 2618 documents in 1.760 min

*VS*

*8GB* heap:

with temp file only completes 2618 documents in 2.0 min

Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 
8GB)

 It looks like the loop in the mergeDocuments accumulates PDDocument objects in 
a list which are closed after the merge is completed.

Refactoring the code to close these as they are used, instead of accumulating 
them and closing all at the end, improves memory usage considerably.(although 
doesn't seem to be eliminated completed based on mat analysis.)

Another change I've implemented is to only create the inputstream when the file 
needs to be read and to close it alongside the PDDocument.

(Some inputstreams contain buffers and depending on the size of the buffers and 
or the stream type accumulating all the streams is a potential memory-hog.)

These changes seems to have a beneficial improvement in the sense that I can 
process the same amount of pdfs with about half the memory.

 I'd appreciate it if you could roll these changes into the main codebase.

(I've respected java 6 compatibility.)

I've included in attachment the java files of the new implementation:
 * Suppliers
 * Supplier
 * PDFMergerUtilityUsingSupplier

PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
changes only internal code changes. (just rename the class to PDFMergerUtility 
if you decide to implemented the changes.)

 In attachment you can also find some screenshots from visualvm showing the 
memory usage of the original version and the refactored version as well as some 
info produced by mat after analysing the heap.

If you know of any other means, without 

[jira] [Updated] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
-
Affects Version/s: 3.0.0 PDFBox

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.9, 3.0.0 PDFBox
>Reporter: Gary Potagal
>Priority: Major
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

2018-04-11 Thread Gary Potagal (JIRA)
Gary Potagal created PDFBOX-4188:


 Summary:  "Maximum allowed scratch file memory exceeded." 
Exception when merging large number of small PDFs
 Key: PDFBOX-4188
 URL: https://issues.apache.org/jira/browse/PDFBOX-4188
 Project: PDFBox
  Issue Type: Improvement
Affects Versions: 2.0.9
Reporter: Gary Potagal


I have been running some tests trying to merge large amounts (2618) of small 
pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)

Memory consumption seems to be the main limitation.

ScratchFileBuffer seems to consume the majority of the memory usage.

(see screenshot from mat in attachment)

(I would include the hprof in attachment so you can analyze yourselves but it's 
rather large)

Note that it seems impossible to generate a large pdf using a small memory 
footprint.

I personally thought that using MemorySettings with temporary file only would 
allow me to generate arbitrarily large pdf files but it doesn't seem to help.

I've run the mergeDocuments with  memory settings:
 * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 
1024L)

 * MemoryUsageSetting.setupTempFileOnly()

Refactored version completes with *4GB* heap:

with temp file only completes 2618 documents in 1.760 min

*VS*

*8GB* heap:

with temp file only completes 2618 documents in 2.0 min

Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 
8GB)

 It looks like the loop in the mergeDocuments accumulates PDDocument objects in 
a list which are closed after the merge is completed.

Refactoring the code to close these as they are used, instead of accumulating 
them and closing all at the end, improves memory usage considerably.(although 
doesn't seem to be eliminated completed based on mat analysis.)

Another change I've implemented is to only create the inputstream when the file 
needs to be read and to close it alongside the PDDocument.

(Some inputstreams contain buffers and depending on the size of the buffers and 
or the stream type accumulating all the streams is a potential memory-hog.)

These changes seems to have a beneficial improvement in the sense that I can 
process the same amount of pdfs with about half the memory.

 I'd appreciate it if you could roll these changes into the main codebase.

(I've respected java 6 compatibility.)

I've included in attachment the java files of the new implementation:
 * Suppliers
 * Supplier
 * PDFMergerUtilityUsingSupplier

PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
changes only internal code changes. (just rename the class to PDFMergerUtility 
if you decide to implemented the changes.)

 In attachment you can also find some screenshots from visualvm showing the 
memory usage of the original version and the refactored version as well as some 
info produced by mat after analysing the heap.

If you know of any other means, without running into memory issues, to merge 
large sets of pdf files into a large single pdf I'd love to hear about it!

I'd also suggest that there should be further improvements made in memory usage 
in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3631) Signature interoperability issue / visible signature not visible on some viewers

2018-04-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433853#comment-16433853
 ] 

Michaël Krens commented on PDFBOX-3631:
---

Same here, it looks like this fixed the issues. Thanks!

> Signature interoperability issue / visible signature not visible on some 
> viewers
> 
>
> Key: PDFBOX-3631
> URL: https://issues.apache.org/jira/browse/PDFBOX-3631
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Affects Versions: 2.0.3
> Environment: Java 1.8 Windows.
>Reporter: Marco Monacelli
>Priority: Major
> Attachments: Microsoft_Edge_on_Windows_10_rendering_correct.png, 
> PDFBOX-3631.zip, PDFJS-4743-signature.pdf, Preview_rendering_not_correct.png, 
> acrobat_rendering_correct.png, chrome_rendering_incorrect.png, 
> firefox_incorrect_but_expected.png, itext-doc.pdf, itext-doc_signed-bad.pdf, 
> itext-doc_signed-good.pdf, libreoffice_rendering_correct.png, 
> not_working_pdf_with_signatures.pdf, 
> preview_of_Preview_rendering_correct.png, safari_using_preview_incorrect.png, 
> test_out_pdf.pdf, working_pdf_with_signatures.pdf, 
> working_pdf_with_signatures.pdf
>
>
> Some files if signed with PDFBox produce not visible signature in chrome, 
> pdfium foxit. 
> If the same file is signed on some Actobat, Foxit or itext the signature is 
> visible.
> The test fle are inserted in an encrypted zip. If possible I would like to 
> communicate the password with a private message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Build failed in Jenkins: PDFBox-trunk » Apache Preflight #3965

2018-04-11 Thread Apache Jenkins Server
See 


--
[...truncated 83.04 KB...]
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'Arial-BoldItalicMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'CourierNewPSMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'ArialMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'CourierNewPS-BoldMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldItalicMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'Arial-BoldMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'Arial-ItalicMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'Arial-BoldItalicMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'CourierNewPSMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'ArialMT'
2018-04-11 09:20:26 WARN  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont:131 - 
Using fallback font 'LiberationSans' for 'CourierNewPS-BoldMT'
  pardes14_Jid02_reduced.pdf
  stat_dis_30_fixed.pdf
  hopf1971.pdf
  Pardes13_Art02.pdf
  modules_acrobat9.pdf
  PDFA_Conference_2009_nc.pdf
2018-04-11 09:20:35 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:36 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:37 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:38 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:38 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:39 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:40 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:41 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:42 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:43 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:43 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:44 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:45 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:46 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:47 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:48 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:48 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:49 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:50 WARN  
org.apache.pdfbox.pdmodel.graphics.color.PDICCBased:206 - ICC profile is 
Perceptual, ignoring, treating as Display class
2018-04-11 09:20:51 WARN  

Build failed in Jenkins: PDFBox-trunk #3965

2018-04-11 Thread Apache Jenkins Server
See 


Changes:

[msahyoun] PDFBOX-3809: support flatten for specific fields only; current 
limitation is that the widget annotation must have a page reference

--
[...truncated 196.23 KB...]
[INFO] 
[INFO] --- maven-bundle-plugin:3.5.0:bundle (default-bundle) @ preflight ---
[INFO] 
[INFO] --- maven-site-plugin:3.7:attach-descriptor (attach-descriptor) @ 
preflight ---
[INFO] Skipping because packaging 'bundle' is not pom.
[INFO] 
[INFO] >>> maven-source-plugin:3.0.1:jar (attach-sources) > generate-sources @ 
preflight >>>
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven-version) @ 
preflight ---
[WARNING] Failed to getClass for org.apache.maven.plugins.source.SourceJarMojo
[INFO] 
[INFO] <<< maven-source-plugin:3.0.1:jar (attach-sources) < generate-sources @ 
preflight <<<
[INFO] 
[INFO] 
[INFO] --- maven-source-plugin:3.0.1:jar (attach-sources) @ preflight ---
[INFO] Building jar: 

[INFO] 
[INFO] --- maven-surefire-plugin:2.20.1:test (surefire-itest) @ preflight ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running org.apache.pdfbox.preflight.integration.TestIsartorValidation
2018-04-11 09:21:08 WARN  
org.apache.pdfbox.preflight.integration.TestIsartorValidation:86 - 
'expected.errors' does not reference valid file, so cannot execute tests : 

2018-04-11 09:21:08 WARN  
org.apache.pdfbox.preflight.integration.AbstractInvalidFileTester:86 - This is 
an empty test
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.385 s 
- in org.apache.pdfbox.preflight.integration.TestIsartorValidation
[INFO] Running org.apache.pdfbox.preflight.integration.TestValidFiles
2018-04-11 09:21:08 WARN  
org.apache.pdfbox.preflight.integration.TestValidFiles:84 - valid.files (where 
are isartor pdf files) is not defined.
No result file defined, will use standard error
2018-04-11 09:21:08 WARN  
org.apache.pdfbox.preflight.integration.TestValidFiles:129 - This is an empty 
test
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 s - 
in org.apache.pdfbox.preflight.integration.TestValidFiles
[INFO] Running org.apache.pdfbox.preflight.integration.TestInvalidFiles
2018-04-11 09:21:08 WARN  
org.apache.pdfbox.preflight.integration.TestInvalidFiles:88 - 'expected.errors' 
does not reference valid file, so cannot execute tests : 

2018-04-11 09:21:08 WARN  
org.apache.pdfbox.preflight.integration.AbstractInvalidFileTester:86 - This is 
an empty test
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 s 
- in org.apache.pdfbox.preflight.integration.TestInvalidFiles
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[JENKINS] Recording test results
[INFO] 
[INFO] --- apache-rat-plugin:0.12:check (default) @ preflight ---
[INFO] Enabled default license matchers.
[INFO] Will parse SCM ignores for exclusions...
[INFO] Finished adding exclusions from SCM ignore files.
[INFO] 61 implicit excludes (use -debug for more details).
[INFO] Exclude: src/main/resources/project.version
[INFO] Exclude: release.properties
[INFO] 155 resources included (use -debug for more details)
[INFO] Rat check: Summary over all files. Unapproved: 0, unknown: 0, generated: 
0, approved: 145 licenses.
[INFO] 
[INFO] --- dependency-check-maven:3.1.2:check (default) @ preflight ---
[INFO] Checking for updates
[INFO] Skipping NVD check since last check was within 4 hours.
[INFO] Check for updates complete (16 ms)
[INFO] Analysis Started
[INFO] Finished Archive Analyzer (0 seconds)
[INFO] Finished File Name Analyzer (0 seconds)
[INFO] Finished Jar Analyzer (0 seconds)
[ERROR] Could not connect to Central search. Analysis failed.
[ERROR] Could not connect to Central search. Analysis failed.
java.io.IOException: Finally failed connecting to Central search. Giving up 
after 5 tries.
at 
org.owasp.dependencycheck.analyzer.CentralAnalyzer.fetchMavenArtifacts(CentralAnalyzer.java:288)
at 
org.owasp.dependencycheck.analyzer.CentralAnalyzer.analyzeDependency(CentralAnalyzer.java:198)
at 
org.owasp.dependencycheck.analyzer.AbstractAnalyzer.analyze(AbstractAnalyzer.java:136)
at org.owasp.dependencycheck.AnalysisTask.call(AnalysisTask.java:88)
at org.owasp.dependencycheck.AnalysisTask.call(AnalysisTask.java:37)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

Build failed in Jenkins: PDFBox-Trunk-jdk9 #424

2018-04-11 Thread Apache Jenkins Server
See 


Changes:

[msahyoun] PDFBOX-3809: support flatten for specific fields only; current 
limitation is that the widget annotation must have a page reference

--
[...truncated 7.13 KB...]
[INFO] 
[INFO] 
[INFO] --- maven-source-plugin:3.0.1:jar (attach-sources) @ pdfbox-parent ---
[INFO] 
[INFO] --- apache-rat-plugin:0.12:check (default) @ pdfbox-parent ---
[INFO] Enabled default license matchers.
[INFO] Will parse SCM ignores for exclusions...
[INFO] Finished adding exclusions from SCM ignore files.
[INFO] 61 implicit excludes (use -debug for more details).
[INFO] Exclude: release.properties
[INFO] 1 resources included (use -debug for more details)
[INFO] Rat check: Summary over all files. Unapproved: 0, unknown: 0, generated: 
0, approved: 1 licenses.
[INFO] 
[INFO] --- dependency-check-maven:3.1.2:check (default) @ pdfbox-parent ---
[INFO] Checking for updates
[INFO] starting getUpdatesNeeded() ...
[INFO] Download Started for NVD CVE - Modified
[INFO] Download Complete for NVD CVE - Modified  (1894 ms)
[INFO] Processing Started for NVD CVE - Modified
[INFO] Processing Complete for NVD CVE - Modified  (10818 ms)
[INFO] Begin database maintenance.
[INFO] End database maintenance.
[INFO] Check for updates complete (80472 ms)
[INFO] Analysis Started
[INFO] Finished File Name Analyzer (0 seconds)
[INFO] Finished Dependency Merging Analyzer (0 seconds)
[INFO] Finished Version Filter Analyzer (0 seconds)
[INFO] Finished Hint Analyzer (0 seconds)
[INFO] Created CPE Index (4 seconds)
[INFO] Skipping CPE Analysis for npm
[INFO] Finished CPE Analyzer (4 seconds)
[INFO] Finished False Positive Analyzer (0 seconds)
[INFO] Finished NVD CVE Analyzer (0 seconds)
[INFO] Finished Vulnerability Suppression Analyzer (0 seconds)
[INFO] Finished Dependency Bundling Analyzer (0 seconds)
[INFO] Analysis Complete (5 seconds)
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ pdfbox-parent 
---
[INFO] Installing 
 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/pdfbox/pdfbox-parent/3.0.0-SNAPSHOT/pdfbox-parent-3.0.0-SNAPSHOT.pom
[INFO] 
[INFO] 
[INFO] Building Apache FontBox 3.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:3.0.0:clean (default-clean) @ fontbox ---
[TASKS] Scanning folder 
' for files 
matching the pattern '**/*.java' - excludes: 
[TASKS] Found 111 files to scan for tasks
Found 14 open tasks.
[TASKS] Computing warning deltas based on reference build #423
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven-version) @ 
fontbox ---
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (process-resource-bundles) 
@ fontbox ---
[INFO] 
[INFO] --- maven-resources-plugin:3.0.2:resources (default-resources) @ fontbox 
---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 92 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.6.0:compile (default-compile) @ fontbox ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 103 source files to 

[WARNING] bootstrap class path not set in conjunction with -source 1.7
[INFO] 
:
 Some input files use unchecked or unsafe operations.
[INFO] 
:
 Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- download-maven-plugin:1.3.0:wget (PDFBOX-4038) @ fontbox ---
[INFO] Got from cache: 
/home/jenkins/jenkins-slave/maven-repositories/0/.cache/download-maven-plugin/SourceSansProBold.otf_ee42692fed82c908ab7e9c43b25ddd47
[INFO] 
[INFO] --- download-maven-plugin:1.3.0:wget (PDFBOX-3997) @ fontbox ---
[INFO] Got from cache: 
/home/jenkins/jenkins-slave/maven-repositories/0/.cache/download-maven-plugin/NotoEmoji-Regular.ttf_9784d0bb7855c7246244cb1d8f77b7c5
[INFO] 
[INFO] --- download-maven-plugin:1.3.0:wget (PDFBOX-3379) @ fontbox ---
[INFO] Got from cache: 
/home/jenkins/jenkins-slave/maven-repositories/0/.cache/download-maven-plugin/DejaVuSansMono.ttf_a2ebcaa160f7aaca8aa6f8360edf5cda
[INFO] 
[INFO] --- maven-resources-plugin:3.0.2:testResources (default-testResources) @ 
fontbox ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.6.0:testCompile (default-testCompile) @ 
fontbox ---
[INFO] Changes detected - recompiling the 

Build failed in Jenkins: PDFBox-Trunk-jdk9 » Apache FontBox #424

2018-04-11 Thread Apache Jenkins Server
See 


--
[INFO] 
[INFO] 
[INFO] Building Apache FontBox 3.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:3.0.0:clean (default-clean) @ fontbox ---
[TASKS] Scanning folder 
'
 for files matching the pattern '**/*.java' - excludes: 
[TASKS] Found 111 files to scan for tasks
Found 14 open tasks.
[TASKS] Computing warning deltas based on reference build #423
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven-version) @ 
fontbox ---
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (process-resource-bundles) 
@ fontbox ---
[INFO] 
[INFO] --- maven-resources-plugin:3.0.2:resources (default-resources) @ fontbox 
---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 92 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.6.0:compile (default-compile) @ fontbox ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 103 source files to 

[WARNING] bootstrap class path not set in conjunction with -source 1.7
[INFO] 
:
 Some input files use unchecked or unsafe operations.
[INFO] 
:
 Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- download-maven-plugin:1.3.0:wget (PDFBOX-4038) @ fontbox ---
[INFO] Got from cache: 
/home/jenkins/jenkins-slave/maven-repositories/0/.cache/download-maven-plugin/SourceSansProBold.otf_ee42692fed82c908ab7e9c43b25ddd47
[INFO] 
[INFO] --- download-maven-plugin:1.3.0:wget (PDFBOX-3997) @ fontbox ---
[INFO] Got from cache: 
/home/jenkins/jenkins-slave/maven-repositories/0/.cache/download-maven-plugin/NotoEmoji-Regular.ttf_9784d0bb7855c7246244cb1d8f77b7c5
[INFO] 
[INFO] --- download-maven-plugin:1.3.0:wget (PDFBOX-3379) @ fontbox ---
[INFO] Got from cache: 
/home/jenkins/jenkins-slave/maven-repositories/0/.cache/download-maven-plugin/DejaVuSansMono.ttf_a2ebcaa160f7aaca8aa6f8360edf5cda
[INFO] 
[INFO] --- maven-resources-plugin:3.0.2:testResources (default-testResources) @ 
fontbox ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.6.0:testCompile (default-testCompile) @ 
fontbox ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 8 source files to 

[WARNING] bootstrap class path not set in conjunction with -source 1.7
[INFO] 
:
 

 uses unchecked or unsafe operations.
[INFO] 
:
 Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.20.1:test (default-test) @ fontbox ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running org.apache.fontbox.cff.CFFParserTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 s 
- in org.apache.fontbox.cff.CFFParserTest
[INFO] Running org.apache.fontbox.cff.Type1FontUtilTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.086 s 
- in org.apache.fontbox.cff.Type1FontUtilTest
[INFO] Running org.apache.fontbox.ttf.BufferedRandomAccessFileTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 s 
- in org.apache.fontbox.ttf.BufferedRandomAccessFileTest
[INFO] Running org.apache.fontbox.ttf.TTFSubsetterTest
Apr 11, 2018 8:31:15 AM org.apache.fontbox.ttf.TTFSubsetter writeToStream
INFO: font subset is empty
Searching for SimHei font...
SimHei font not available on this machine, test skipped
Apr 11, 2018 8:31:15 AM org.apache.fontbox.ttf.TTFSubsetter writeToStream
INFO: font subset is empty
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.304 s 
- in org.apache.fontbox.ttf.TTFSubsetterTest
[INFO] Running org.apache.fontbox.ttf.TestTTFParser
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 

[jira] [Assigned] (PDFBOX-3809) PDAcroForm.flatten(PDField list, refreshAppearances boolean) flattens all form fields instead of specified ones.

2018-04-11 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun reassigned PDFBOX-3809:
--

Assignee: Maruan Sahyoun

> PDAcroForm.flatten(PDField list, refreshAppearances boolean) flattens all 
> form fields instead of specified ones.
> 
>
> Key: PDFBOX-3809
> URL: https://issues.apache.org/jira/browse/PDFBOX-3809
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Affects Versions: 2.0.5, 2.0.6, 2.0.7
>Reporter: Cristin Donaher
>Assignee: Maruan Sahyoun
>Priority: Minor
> Attachments: Example of fields that need to enter and the calculated 
> field from those values.docx, sf270.pdf
>
>
> Thanks for the excellent PDF library.   For my use case I need to flatten a 
> subset of the AcroForm fields.  I was attempting to use the 
> PDAcroForm.flatten call, passing in my field list.  However, after the method 
> is called, all the fields are gone.  
> The method itself appears to remove all PDFAnnotationWidgets from each page 
> and at the end clears the acroform's field set.
> Is the javadoc description (This will flatten the specified form fields.) 
> just misleading?   Could a flatten call for a subset of fields be added?
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3809) PDAcroForm.flatten(PDField list, refreshAppearances boolean) flattens all form fields instead of specified ones.

2018-04-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433547#comment-16433547
 ] 

ASF subversion and git services commented on PDFBOX-3809:
-

Commit 1828871 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1828871 ]

PDFBOX-3809: support flatten for specific fields only; current limitation is 
that the widget annotation must have a page reference

> PDAcroForm.flatten(PDField list, refreshAppearances boolean) flattens all 
> form fields instead of specified ones.
> 
>
> Key: PDFBOX-3809
> URL: https://issues.apache.org/jira/browse/PDFBOX-3809
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Affects Versions: 2.0.5, 2.0.6, 2.0.7
>Reporter: Cristin Donaher
>Priority: Minor
> Attachments: Example of fields that need to enter and the calculated 
> field from those values.docx, sf270.pdf
>
>
> Thanks for the excellent PDF library.   For my use case I need to flatten a 
> subset of the AcroForm fields.  I was attempting to use the 
> PDAcroForm.flatten call, passing in my field list.  However, after the method 
> is called, all the fields are gone.  
> The method itself appears to remove all PDFAnnotationWidgets from each page 
> and at the end clears the acroform's field set.
> Is the javadoc description (This will flatten the specified form fields.) 
> just misleading?   Could a flatten call for a subset of fields be added?
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org