[
https://issues.apache.org/jira/browse/PDFBOX-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658351#comment-16658351
]
ASF subversion and git services commented on PDFBOX-4345:
---------------------------------------------------------
Commit 1844512 from [email protected] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1844512 ]
PDFBOX-3646, PDFBOX-4345: remove println
> FDFAnnotation.richContentsToString does not evaluate text nodes which have
> siblings in the XML
> ----------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4345
> URL: https://issues.apache.org/jira/browse/PDFBOX-4345
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 2.0.12
> Reporter: Kai Keggenhoff
> Assignee: Tilman Hausherr
> Priority: Major
> Labels: xfdf
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: FDFAnnotation_diff.txt, FDFAnnotation_new.java,
> MergeTest.java
>
>
> The method FDFAnnotation.richContentsToString does not evaluate text nodes
> which have siblings in the XML which can lead to missing text when you parse
> XFDF data and add the annotations to a PDF.
> Example : parsing a XFDF string containing
> <p>Text A <span style="text-decoration:word;">Text B</span> Text C</p>
> and adding the annotation will display only "+Text B+".
> I've included a code sample (MergeTest.java) which generates two PDFs.
> For one PDF, the paragraph contains only spans with text nodes as their only
> children and all the text is included, for the other PDF, the paragraph has
> mixed text nodes and elements as children and here, the content from the text
> siblings of the "span" is missing.
> I propose the following fix:
> Instead of traversing the children of an element with the XPath "*"
> expression, simply iterate the children obtained from Node.getChildNodes(),
> process Text and CDATASection nodes directly and call richContentsToString
> for any elements.
> (source : FDFAnnotation_new.java, diff to 2.0.12 : FDFAnnotation_diff.txt)
> Furthermore, this method needs to escape "<" and "&" in the text values read
> from the node values, because if these characters are added to the markup,
> it'll cause corruption of annotations as described in PDFBOX-3646.
> Additionally, I added quoting " as " to the attribute values to avoid
> possible corruption there.
>
> Note : my first attempt of a fix was to replace the XPath "*" expression with
> "node()", but for some reason, when I used this on a test case of
> <p><![CDATA[A]]> B <span>C</span> D</p>
> I would only obtain a NodeList containing the CDATASection, the "span"
> element and the final text node, but not the text node containing "B".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]