[ 
https://issues.apache.org/jira/browse/PDFBOX-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658351#comment-16658351
 ] 

ASF subversion and git services commented on PDFBOX-4345:
---------------------------------------------------------

Commit 1844512 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1844512 ]

PDFBOX-3646, PDFBOX-4345: remove println

> FDFAnnotation.richContentsToString does not evaluate text nodes which have 
> siblings in the XML
> ----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4345
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4345
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.12
>            Reporter: Kai Keggenhoff
>            Assignee: Tilman Hausherr
>            Priority: Major
>              Labels: xfdf
>             Fix For: 2.0.13, 3.0.0 PDFBox
>
>         Attachments: FDFAnnotation_diff.txt, FDFAnnotation_new.java, 
> MergeTest.java
>
>
> The method FDFAnnotation.richContentsToString does not evaluate text nodes 
> which have siblings in the XML which can lead to missing text when you parse 
> XFDF data and add the annotations to a PDF.
> Example : parsing a XFDF string containing
> <p>Text A <span style="text-decoration:word;">Text B</span> Text C</p>
> and adding the annotation will display only "+Text B+".
> I've included a code sample (MergeTest.java) which generates two PDFs.
>  For one PDF, the paragraph contains only spans with text nodes as their only 
> children and all the text is included, for the other PDF, the paragraph has 
> mixed text nodes and elements as children and here, the content from the text 
> siblings of the "span" is missing.
> I propose the following fix:
> Instead of traversing the children of an element with the XPath "*" 
> expression, simply iterate the children obtained from Node.getChildNodes(), 
> process Text and CDATASection nodes directly and call richContentsToString 
> for any elements.
> (source : FDFAnnotation_new.java, diff to 2.0.12 : FDFAnnotation_diff.txt)
> Furthermore, this method needs to escape "<" and "&" in the text values read 
> from the node values, because if these characters are added to the markup, 
> it'll cause corruption of annotations as described in PDFBOX-3646.
> Additionally, I added quoting " as &quot; to the attribute values to avoid 
> possible corruption there.
>  
> Note : my first attempt of a fix was to replace the XPath "*" expression with 
> "node()", but for some reason, when I used this on a test case of
> <p><![CDATA[A]]> B <span>C</span> D</p>
> I would only obtain a NodeList containing the CDATASection, the "span" 
> element and the final text node, but not the text node containing "B".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to