[jira] [Comment Edited] (PDFBOX-1511) pdfMerger App produces Garbage

Kirk Haines (JIRA) Wed, 31 Jul 2013 12:02:29 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721174#comment-13721174
 ]


Kirk Haines edited comment on PDFBOX-1511 at 7/31/13 7:00 PM:
--------------------------------------------------------------

I have also experienced this (Windows 7, Java 1.6.0_35-b10 64-bit) in PDFBox 
1.7.1 thru the current trunk.  I tried Maruan's suggestion and it resolved the 
issue, at the expense of creating unnecessary duplicate resources.  However, it 
did not create extra copies of these resources for each page.  Once a Resources 
object was cloned the first time, it was reused.  Consequently there is only 
one copy of the Resources from each input file, with a reference to the 
appropriate Resources object on each page.  My documents did not have existing 
page level Resources, so I am not sure how Maruan's suggestion would work in 
those cases.  Creating a PageGroup to hold all pages from a given input 
document may be a better option to avoid this issue.

I had noticed that the corruption in subsequent documents resulted in those 
pages having their formatting preserved, but the text content had many letters 
substituted (all 'd' replaced by 'f', all 'y' replaced by 'd', etc.)  I also 
found that the degree of corruption depended on how similar the beginning text 
content of each input document was.  When there was a common header in the 
documents being merged, there were only a few substitutions.  When it was 
merging a document with itself, there were no errors.  When the document header 
was very different, the resulting text was undecipherable garbage.  This made 
me suspect that it may be a problem with a compression dictionary, using the 
dictionary from the first file on subsequent files.  At first I thought this 
dictionary was in the flate compression being applied to the stream, but found 
that it was in the CMap of a font resource.  Both documents used the same name 
for the font, so the PDFMerger only retained the copy from the first PDF in the 
merged PDF.  When subsequent pages from subsequent input documents referenced 
the font, they used the CMap dictionary from the first input document, 
resulting in various degrees of garbled text.  Lesson learned, Font resources 
may have content that is dependent on the strings they were used to display.
                
      was (Author: kirk.haines):
    I have also experienced this (Windows 7, Java 1.6.0_35-b10 64-bit) in 
PDFBox 1.7.1 thru the current trunk.  I tried Maruan's suggestion and it 
resolved the issue, at the expense of creating unnecessary duplicate resources. 
 I had noticed that the corruption in subsequent documents resulted in those 
pages having their formatting preserved, but the text content had many letters 
substituted (all 'd' replaced by 'f', all 'y' replaced by 'd', etc.)  I also 
found that the degree of corruption depended on how similar the beginning text 
content of each input document was.  When there was a common header in the 
documents being merged, there were only a few substitutions.  When it was 
merging a document with itself, there were no errors.  When the document header 
was very different, the resulting text was undecipherable garbage.  This made 
me suspect that it may be a problem with a compression dictionary, using the 
dictionary from the first file on subsequent files.  At first I thought this 
dictionary was in the flate compression being applied to the stream, but found 
that it was in the CMap of a font resource.  Both documents used the same name 
for the font, so the PDFMerger only retained the copy from the first PDF in the 
merged PDF.  When subsequent pages from subsequent input documents referenced 
the font, they used the CMap dictionary from the first input document, 
resulting in various degrees of garbled text.  Lesson learned, Font resources 
may have content that is dependent on the strings they were used to display.
                  
> pdfMerger App produces Garbage
> ------------------------------
>
>                 Key: PDFBOX-1511
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1511
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.7.1
>         Environment: Win XP; Windows Server 2008 R2; java version "1.6.0_21", 
>            Reporter: Michael Huber
>         Attachments: 1.pdf, 2.pdf, PdfRenderer.java, targetPdfMergeJava.pdf, 
> targetPdfMergeUtilityApp.pdf
>
>
> pdfbox Utility pdfMerger produces a merged document containing garbage. All 
> merged pdf files are contained but Strings are destroyed.
> The source pdf files are created with graphviz and are readable without error 
> or disturbance both with Acrobat X and pdfbox pdfDebug Utility.
> Another astoundig thing is that a handcoded merger using pdfMergerUtility 
> class works fine when run within Eclipse Juno and creates same garbage when 
> run from cmd line (pls. see attached source)
> I checked everything that comes in mind to find the differences, e.g. Java 
> version, encoding/codepage issues, memory settings, found nothing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (PDFBOX-1511) pdfMerger App produces Garbage

Reply via email to