[
https://issues.apache.org/jira/browse/PDFBOX-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056647#comment-18056647
]
Michael Klink edited comment on PDFBOX-6163 at 2/5/26 1:20 PM:
---------------------------------------------------------------
Have you checked whether the result PDF file is structurally broken? (If you
had shared an example file, that's one of the things I'd have done with it.)
Because if it is structurally broken, one has to look for different causes than
if it is not.
That being asked, there is no good reason why in general after merging only
text content for a specific font is broken. It's more likely that this is due
to some peculiarity of the documents in question. I'd assume other types of
input documents would show other issues.
As the issue only occurs in the Windchill environment, have you checked whether
there are any restrictions of that environment that may break your process?
E.g. googling around I saw that certain parts of Windchill restrict file sizes
to 4 MB which the merge might exceed.
Also it is possible that already your incoming files are broken. Have you tried
copying all your input and output to some storage for debugging?
was (Author: mkl):
Have you checked whether the result PDF file is structurally broken? (If you
had shared an example file, that's one of the things I'd have done with it.)
Because if it is structurally broken, one has to look for different causes than
if it is not.
That being asked, there is no good reason why in general after merging only
text content for a specific font is broken. It's more likely that this is due
to some peculiarity of the documents in question. I'd assume other types of
input documents would show other issues.
As the issue only occurs in the Windchill environment, have you checked whether
there are any restrictions of that environment that may break your process?
E.g. googling around I saw that certain parts of Windchill restrict file sizes
to 4 MB which might the merge might exceed.
Also it is possible that already your incoming files are broken. Have you tried
copying all your input and output to some storage for debugging?
> Technical Issue Report: Intermittent Data Loss and Font Missing Errors during
> PDF Mergin
> ----------------------------------------------------------------------------------------
>
> Key: PDFBOX-6163
> URL: https://issues.apache.org/jira/browse/PDFBOX-6163
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox, Utilities
> Affects Versions: 2.0.33, 2.0.35
> Environment: Production Environament. Windows server 2019
> Reporter: Mruthyunjaya S
> Priority: Critical
> Labels: Windchill
> Attachments: After Merge Of the PDFS.png, DataLoss.png,
> Ezk00aGGaL8oi2A3.png, before Merge.png
>
>
> h3. *Technical Issue Report: Intermittent Data Loss and Font Missing Errors
> during PDF Merging*
> *Environment:*
> * *Library:* PDFBox 2.0.33
> * *Operating System:* Windows Server 2019
> * *Application Context:* Windchill MethodServer
> * *Affected Fonts:* Japanese fonts, specifically *MS Gothic* (Subsetted)
> !BONwP8BDbZ1SWAE0xMAAAAASUVORK5CYII=!
> *Issue Summary:* We are experiencing a high failure rate ({*}80% failure{*})
> when merging technical CAD drawings into a single PDF package. Even though
> the source PDFs have fonts embedded, the generated "Merged PDF" frequently
> suffers from missing fonts or garbled text on specific pages.
> *Key Observations:*
> * *Reproducibility:* In a test of 10 consecutive merge attempts, the output
> was correct only {*}2 times{*}, while the remaining *8 attempts* resulted in
> missing fonts.
> * *Specific Error:* The issue is almost exclusively linked to Japanese *MS
> Gothic* variants (e.g., {{{}ACWDKT+MS
> Gothic{}}}).
> * *Error Logs:* We frequently encounter {{Format 14 cmap table is not
> supported}} and {{Format 12 cmap contains an invalid
> glyph index}} warnings during the process.
> * *Client Behavior:* Adobe Acrobat fails to render the text, displaying the
> error: {_}"Cannot extract the embedded font 'ACWDKT+MS Gothic'. Some
> characters may not display or print correctly."{_}.
> *Suspected Causes:*
> # *I/O Race Condition:* We intermittently receive {{{}java.io.IOException:
> Missing root
> object specification in trailer{}}}, suggesting the merger may
> be accessing files before they are fully flushed to disk or while they are
> still locked by the external converter.
> # *Resource Clashing:* We suspect the {{PDFMergerUtility}} may be clashing
> font resource aliases (like {{{}/F1{}}}) across different subsetted drawings,
> leading to corrupted Character Maps (CMaps) in the final document.
> *Current Code Implementation:* We have attempted to fix this by implementing
> a {*}Targeted Healing Pass{*}. We re-open the merged PDF, scan for corrupted
> subsets, and re-embed a full English *Century Gothic* font using
> {{PDResources.put()}} and {{{}PDPageContentStream.AppendMode.APPEND{}}}.
> Despite this, the inconsistency persists.
> !B9LkDnsB5VwVAAAAABJRU5ErkJggg==|width=691,height=127!
> !TpFB0dHTVRXy1X8sSJE3qprF27tpZfhibqnp2djZOTE5aWlpqor1YrqfRpJd3lfwB8
> 2EI21HCfQAAAABJRU5ErkJggg==!
> !wNB1HRIeiHDFwAAAABJRU5ErkJggg==|width=808,height=565!
>
>
>
> # Is there a built-in mechanism in {{PDFMergerUtility}} to "flatten" or
> deduplicate subsetted fonts during the merge to prevent CMap clashing?
> # Given that Format 14/12 warnings are logged but don't throw exceptions, is
> there a recommended way to programmatically detect this "data loss" state
> before the file is saved?
> # Are there known issues with {{setupTempFileOnly()}} vs
> {{setupMainMemoryOnly()}} when dealing with large, complex vector drawings
> that might contribute to trailer parsing failures?
>
>
>
> =======================Code For merge i had used=====================
> private void mergeUsingPDFBox(List<String> pdfFiles, String outputFile)
> throws IOException {
> PDFMergerUtility merger = new PDFMergerUtility();
> merger.setDestinationFileName(outputFile);
> for (String file : pdfFiles) {
> merger.addSource(new File(file));
> }
> merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
> }
> =====================================================
>
>
>
> ===================== I had tryed for the missing embed fonts to fix the
> reemmbed Bu it have the still issue.===============================
> private void mergeUsingPDFBox(List<String> pdfFiles, String outputPath)
> throws IOException {
> org.apache.pdfbox.multipdf.PDFMergerUtility merger = new
> org.apache.pdfbox.multipdf.PDFMergerUtility();
> merger.setDestinationFileName(outputPath);
> System.out.println("\n[PHASE 1] Initial PDF Merging...");
> for (String filePath : pdfFiles) {
> File sourceFile = new File(filePath);
>
> // LOGIC: Prevent "Missing root object specification in trailer"
> error
> // This happens if we try to merge a file that is still 0-bytes
> (locked by converter)
> if (sourceFile.exists() && sourceFile.length() > 100) {
> merger.addSource(sourceFile);
> System.out.println(" --> Added to merge queue: " +
> sourceFile.getName() + " [" + sourceFile.length() + " bytes]");
> } else {
> System.out.println(" WARN: Skipping empty/invalid file
> (might be locked by conversion): " + filePath);
> }
> }
> // Execute merge using Main Memory to protect Windchill server heap
>
> merger.mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting.setupMainMemoryOnly());
> System.out.println("[PHASE 2] Merge Saved to Disk. Starting CMap
> Corruption Scan...");
> // RE-OPEN the result to identify and heal Japanese encoding issues
> try (PDDocument mergedDoc = PDDocument.load(new File(outputPath))) {
>
> // LOGIC: Remove security. Modified content streams are blocked
> if owner password exists.
> if (mergedDoc.isEncrypted()) {
> System.out.println("DEBUG: Removing encryption to allow font
> re-embedding...");
> mergedDoc.setAllSecurityToBeRemoved(true);
> }
> // Load replacement font ONCE per document loop for memory
> efficiency
> String fontPath = getFontFileFor("GOTHIC");
> try (InputStream fontStream = new FileInputStream(new
> File(fontPath))) {
>
> // Load FULL font (no subsetting) to ensure all English
> character indices exist
> PDType0Font englishFont = PDType0Font.load(mergedDoc,
> fontStream, false);
> int repairCount = 0;
> for (int i = 0; i < mergedDoc.getNumberOfPages(); i++) {
> PDPage page = mergedDoc.getPage(i);
> PDResources res = page.getResources();
> if (res == null) continue;
> boolean pageHasError = false;
>
> // STEP A: Detect CMap Corruption (Format 12/14 warnings)
> for (COSName fontAlias : res.getFontNames()) {
> PDFont font = res.getFont(fontAlias);
>
> // Use our helper to force a check of the internal
> font mapping
> if (isFontCorrupted(font)) {
> System.out.println(" ALERT: Page " + (i + 1) + "
> contains corrupted Japanese CMap/Subsets. Repairing...");
> pageHasError = true;
> break;
> }
> }
> // STEP B: Heal the problematic page
> if (pageHasError) {
> for (COSName fontAlias : res.getFontNames()) {
> String name =
> res.getFont(fontAlias).getName().toUpperCase();
>
> // LOGIC: Target Gothic subsets (+) that failed
> the validation check
> if (name.contains("GOTHIC") || name.contains("+")
> || name.contains("MS-")) {
> res.put(fontAlias, englishFont);
> repairCount++;
> }
> }
>
> // LOGIC: Re-render the operational stream
> // Adding a space forces the PDF viewer to reload the
> character map using our new font
> try (PDPageContentStream cs = new
> PDPageContentStream(mergedDoc, page,
> PDPageContentStream.AppendMode.APPEND, true,
> true)) {
> cs.beginText();
> cs.setFont(englishFont, 1);
> cs.newLineAtOffset(0, 0);
> cs.showText(" ");
> cs.endText();
> }
> }
> }
> System.out.println("INFO : Total font mappings repaired
> during final pass: " + repairCount);
> }
> // Final Save: Overwrite the merged file with the high-fidelity
> English version
> mergedDoc.save(outputPath);
> System.out.println("[PHASE 3] Final healing pass complete. Output
> verified.");
> } catch (Exception e) {
> System.err.println("CRITICAL ERROR: Failed to heal the merged
> PDF: " + e.getMessage());
> e.printStackTrace();
> }
> System.out.println(">>> SUCCESS! High-Fidelity PDF saved to: " +
> outputPath + "\n");
> }
> =================================================================
> --
> *Logs Captured during Merge:* Our internal diagnostic tools show the
> following warnings from FontBox during the failing merges:
> * {{org.apache.fontbox.ttf.CmapSubtable:
> Format 14 cmap table is not supported and will be
> ignored}}
> * {{org.apache.fontbox.ttf.CmapSubtable:
> Format 12 cmap contains an invalid glyph index}}
> *Questions for the Community:*
> # Why would the merger intermittently corrupt the CMap of a subsetted font
> that is already valid in the source document?
> # Is there a way to force {{PDFMergerUtility}} to *not* rename font subsets
> during merging, as we suspect alias clashing is causing the 80% failure rate?
> # Is there a more reliable way to "flatten" these Japanese fonts during the
> merge process to ensure 100% rendering success?
> \{*}How can we reslove this issue , please help us{*}.
> **
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]