All, I can't tell if the triggering file is corrupt or how we want to handle it on the PDFBox side. The problem is that the parent node is a PDTextField -- a PDTerminalField -- so we don't/can't look for children, even though it actually does have pointers in Kids.
The output from PrintFields is: 1 top-level fields were found on the form |--parent.parent = , type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField -----Original Message----- From: Tim Allison (JIRA) [mailto:[email protected]] Sent: Monday, August 14, 2017 10:36 AM To: [email protected] Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively [ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756 ] > Non-terminal interactive form fields not handled recursively > ------------------------------------------------------------ > > Key: TIKA-2442 > URL: https://issues.apache.org/jira/browse/TIKA-2442 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.14 > Reporter: Christopher Creutzig > Attachments: simple-form.pdf > > > (I am not sure if this is a Tika or a PDFBox problem; I tried finding > a form extractor in PDFBox, but the app api does not have one. PDFDebugger > does show me the expected tree structure.) The attached PDF has a > non-terminal field named “parent” and two children, “child1” and “child2.” > According to the PDF spec in section 8.6, the fully qualified field names > should be parent.child1 and parent.child2. That is the output given by pdftk: > > pdftk simple-form.pdf dump_data_fields > --- > FieldType: Text > FieldName: parent.child1 > FieldFlags: 0 > FieldValue: child1 value > FieldJustification: Left > --- > FieldType: Text > FieldName: parent.child2 > FieldFlags: 0 > FieldValue: child2 value > FieldJustification: Left > Tika with the ToXMLContentHandler seems to silently ignore the children, > however, returning only a parent with no value. > Calling code: > import java.io.FileInputStream; > import org.apache.tika.detect.DefaultDetector; > import org.apache.tika.detect.Detector; import > org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.PasswordProvider; > import org.apache.tika.sax.ToXMLContentHandler; > class readAsXHTML { > public static String readAsXHTML(String filename) throws Exception { > ToXMLContentHandler handler = new ToXMLContentHandler(); > Detector detector = new DefaultDetector(); > Parser parser = new AutoDetectParser(detector); > ParseContext context = new ParseContext(); > Metadata metadata = new Metadata(); > FileInputStream fh = null; > final String pass = password; > try { > fh = new FileInputStream(filename); > parser.parse(fh, handler, metadata, context); > > return(handler.toString()); > } > finally { > if (fh != null) { > fh.close(); > } > } > } > } > Abbreviated output: > <body><div class="page"><p /> > </div> > <div class="acroform"><ol> <li>parent: </li> > </ol> > </div> > </body> > Expected: > <body><div class="page"><p /> > </div> > <div class="acroform"><ol> > <li>parent.child1: child1 value</li> > <li>parent.child2: child2 value</li> </ol> </div> </body> -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
