[ 
https://issues.apache.org/jira/browse/TIKA-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251034#comment-17251034
 ] 

Tilman Hausherr commented on TIKA-3246:
---------------------------------------

calling {{getAcroform()}} with {{new TikaAcroFormFixup(document)}} helps, 
subsequent calls to {{getAcroform()}} must be with null.
I'll run regression tests to see whether the problems (which the 3 ghostscript 
files from TIKA-3253) are gone
{code}
    /**
     * Copied from AcroformDefaultFixup minus generation of appearances, which 
we don't need.
     */
    class TikaAcroFormFixup extends AbstractFixup
    {
        TikaAcroFormFixup(PDDocument document) {
            super(document);
        }

        @Override
        public void apply() {
            new AcroFormDefaultsProcessor(document).process();

            /*
             * Get the AcroForm in it's current state.
             *
             * Also note: getAcroForm() applies a default fixup which this 
processor
             * is part of. So keep the null parameter otherwise this will end
             * in an endless recursive call
             */
            PDAcroForm acroForm = 
document.getDocumentCatalog().getAcroForm(null);

            if (acroForm != null && acroForm.getFields().isEmpty()) {
                new AcroFormOrphanWidgetsProcessor(document).process();
            }
        }
    }
{code}

> IllegalArgumentException when generation of appearances fails
> -------------------------------------------------------------
>
>                 Key: TIKA-3246
>                 URL: https://issues.apache.org/jira/browse/TIKA-3246
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.25
>            Reporter: Tilman Hausherr
>            Priority: Major
>
> {noformat}
> java.lang.IllegalArgumentException: No glyph for U+0041 (A) in font 
> BZZZZZ+Aladin-Regular
>       at 
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.encode(PDCIDFontType2.java:372)
>       at 
> org.apache.pdfbox.pdmodel.font.PDType0Font.encode(PDType0Font.java:422)
>       at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:332)
>       at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:363)
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.calculateFontSize(AppearanceGeneratorHelper.java:859)
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.insertGeneratedAppearance(AppearanceGeneratorHelper.java:494)
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceContent(AppearanceGeneratorHelper.java:422)
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(AppearanceGeneratorHelper.java:232)
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:264)
>       at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.refreshAppearances(PDAcroForm.java:327)
>       at 
> org.apache.pdfbox.pdmodel.fixup.processor.AcroFormGenerateAppearancesProcessor.process(AcroFormGenerateAppearancesProcessor.java:54)
>       at 
> org.apache.pdfbox.pdmodel.fixup.AcroFormDefaultFixup.apply(AcroFormDefaultFixup.java:56)
>       at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:132)
>       at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:113)
>       at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:267)
> {noformat}
> This is related to a change in PDFBox in {{PDDocumentCatalog.getAcroForm()}}, 
> we try to "fix" fields when they exist as annotations but not as fields. I 
> wonder if this is needed at all.
> It happens with several files, among them the two AML files of PDFBOX-4086.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to