[ https://issues.apache.org/jira/browse/PDFBOX-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658351#comment-16658351 ]
ASF subversion and git services commented on PDFBOX-4345: --------------------------------------------------------- Commit 1844512 from til...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1844512 ] PDFBOX-3646, PDFBOX-4345: remove println > FDFAnnotation.richContentsToString does not evaluate text nodes which have > siblings in the XML > ---------------------------------------------------------------------------------------------- > > Key: PDFBOX-4345 > URL: https://issues.apache.org/jira/browse/PDFBOX-4345 > Project: PDFBox > Issue Type: Bug > Components: PDModel > Affects Versions: 2.0.12 > Reporter: Kai Keggenhoff > Assignee: Tilman Hausherr > Priority: Major > Labels: xfdf > Fix For: 2.0.13, 3.0.0 PDFBox > > Attachments: FDFAnnotation_diff.txt, FDFAnnotation_new.java, > MergeTest.java > > > The method FDFAnnotation.richContentsToString does not evaluate text nodes > which have siblings in the XML which can lead to missing text when you parse > XFDF data and add the annotations to a PDF. > Example : parsing a XFDF string containing > <p>Text A <span style="text-decoration:word;">Text B</span> Text C</p> > and adding the annotation will display only "+Text B+". > I've included a code sample (MergeTest.java) which generates two PDFs. > For one PDF, the paragraph contains only spans with text nodes as their only > children and all the text is included, for the other PDF, the paragraph has > mixed text nodes and elements as children and here, the content from the text > siblings of the "span" is missing. > I propose the following fix: > Instead of traversing the children of an element with the XPath "*" > expression, simply iterate the children obtained from Node.getChildNodes(), > process Text and CDATASection nodes directly and call richContentsToString > for any elements. > (source : FDFAnnotation_new.java, diff to 2.0.12 : FDFAnnotation_diff.txt) > Furthermore, this method needs to escape "<" and "&" in the text values read > from the node values, because if these characters are added to the markup, > it'll cause corruption of annotations as described in PDFBOX-3646. > Additionally, I added quoting " as " to the attribute values to avoid > possible corruption there. > > Note : my first attempt of a fix was to replace the XPath "*" expression with > "node()", but for some reason, when I used this on a test case of > <p><![CDATA[A]]> B <span>C</span> D</p> > I would only obtain a NodeList containing the CDATASection, the "span" > element and the final text node, but not the text node containing "B". -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org