https://issues.apache.org/bugzilla/show_bug.cgi?id=49189
Summary: XWPFWordExtractor discards <w:tab/> entries.
Product: POI
Version: 3.7-dev
Platform: PC
OS/Version: Windows XP
Status: NEW
Severity: normal
Priority: P2
Component: XWPF
AssignedTo: [email protected]
ReportedBy: [email protected]
Created an attachment (id=25358)
--> (https://issues.apache.org/bugzilla/attachment.cgi?id=25358)
Test document which exposes the problem
In the current trunk, two characters separated by a tab character are glued
together the tab is removed.
I tried to debug the issue and found a following piece of code in
XWPFParagraph.getText() method:
XmlObject o = c.getObject();
if (o instanceof CTText) {
text.append(((CTText) o).getStringValue());
}
if (o instanceof CTPTab) {
text.append("\t");
}
This seems to assume that wherever a <w:tab/> construct appears in the source
text file, XMLBeans will return an instance of CTPTab. Unfortunately in my case
it seems to return CTEmptyImpl, which is not a CTPTab.
I tried to read the specs, and in section 17.3.1.37 it says that there is only
one possible parent element for <w:tab> and it is <w:tabs>. In my file,
generated with office 2010 beta I have:
<w:p w14:paraId="4EB09767" w14:textId="77777777" w:rsidR="00B3064F"
w:rsidRDefault="00B3064F">
<w:r>
<w:t>a</w:t>
</w:r>
<w:r>
<w:tab />
<w:t>b</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack" />
<w:bookmarkEnd w:id="0" />
</w:p>
You see that <w:tab /> is note enclosed within <w:tabs></w:tabs>
This might imply that either office produces a wrong file, or the OpenXML XSDs
are wrong, or there is something wrong with XMLBeans class generator, or with
its runtime parser.
Could someone with more knowledge of the OpenXML format take a look at this?
This error spoils fulltext indexing and seems pretty important for the users of
the Aperture Framework.
The easiest workaround for me would be to add a third 'if' for CTEmptyImpl and
put a space in the output. Superfluous whitespace (almost) never hurts, while
glueing words together is bad, but as I said, my knowledge on this topic is
limited.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]