https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

            Bug ID: 61470
           Summary: Text with phonetic runs aren't extracted in docx
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: ---

Created attachment 35269
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35269&action=edit
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within
"ruby" sections.  This means that neither the primary text ("東京") nor the
phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to