[
https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393296#comment-16393296
]
Tilman Hausherr edited comment on TIKA-2442 at 3/9/18 6:04 PM:
---------------------------------------------------------------
Isn't this issue solved? (I stumbled upon it while searching for something else)
was (Author: tilman):
Isn't this issue solved? (I stumbled up it while searching for something else)
> Non-terminal interactive form fields not handled recursively
> ------------------------------------------------------------
>
> Key: TIKA-2442
> URL: https://issues.apache.org/jira/browse/TIKA-2442
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.14
> Reporter: Christopher Creutzig
> Priority: Major
> Attachments: simple-form.pdf
>
>
> (I am not sure if this is a Tika or a PDFBox problem; I tried finding a form
> extractor in PDFBox, but the app api does not have one. PDFDebugger does show
> me the expected tree structure.)
> The attached PDF has a non-terminal field named “parent” and two children,
> “child1” and “child2.” According to the PDF spec in section 8.6, the fully
> qualified field names should be parent.child1 and parent.child2. That is the
> output given by pdftk:
> > pdftk simple-form.pdf dump_data_fields
> ---
> FieldType: Text
> FieldName: parent.child1
> FieldFlags: 0
> FieldValue: child1 value
> FieldJustification: Left
> ---
> FieldType: Text
> FieldName: parent.child2
> FieldFlags: 0
> FieldValue: child2 value
> FieldJustification: Left
> Tika with the ToXMLContentHandler seems to silently ignore the children,
> however, returning only a parent with no value.
> Calling code:
> import java.io.FileInputStream;
> import org.apache.tika.detect.DefaultDetector;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.PasswordProvider;
> import org.apache.tika.sax.ToXMLContentHandler;
> class readAsXHTML {
> public static String readAsXHTML(String filename) throws Exception {
> ToXMLContentHandler handler = new ToXMLContentHandler();
> Detector detector = new DefaultDetector();
> Parser parser = new AutoDetectParser(detector);
> ParseContext context = new ParseContext();
> Metadata metadata = new Metadata();
> FileInputStream fh = null;
> final String pass = password;
> try {
> fh = new FileInputStream(filename);
> parser.parse(fh, handler, metadata, context);
>
> return(handler.toString());
> }
> finally {
> if (fh != null) {
> fh.close();
> }
> }
> }
> }
> Abbreviated output:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol> <li>parent: </li>
> </ol>
> </div>
> </body>
> Expected:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>
> <li>parent.child1: child1 value</li>
> <li>parent.child2: child2 value</li>
> </ol>
> </div>
> </body>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)