https://bz.apache.org/bugzilla/show_bug.cgi?id=61470
Bug ID: 61470
Summary: Text with phonetic runs aren't extracted in docx
Product: POI
Version: unspecified
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: XWPF
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: ---
Created attachment 35269
--> https://bz.apache.org/bugzilla/attachment.cgi?id=35269&action=edit
example file
Over on TIKA-2448, I found that our DOM model is not extracting runs within
"ruby" sections. This means that neither the primary text ("東京") nor the
phonetic text ("とうきょう") is extracted.
The more general point is that a run can contain a run...ugh!
<w:body>
<w:p>
<w:r>
<w:rPr>
...
</w:rPr>
<w:ruby>
<w:rt>
<w:r>
<w:rPr>
.....
</w:rPr>
<w:t>とうきょう</w:t>
</w:r>
</w:rt>
<w:rubyBase>
<w:r w:rsidR="001B7DA3">
<w:rPr>
....
</w:rPr>
<w:t>東京</w:t>
</w:r>
</w:rubyBase>
</w:ruby>
</w:r>
</w:p>
</w:body>
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]