Hello,
I am working on an urgent project using Tika and am having problems with POI's
extraction of content from Word docx files.
When POI finds a <w:p> tag in the document.xml file it adds a "\n" to the
string output as expected. But when POI finds a <w:br> tag, it does nothing,
which is causing words in the text to be merged together rather than on
different lines.
I have located the source of the problem in versions 3.6 and 3.7beta3, but I am
not a greatly experienced developer and could use some help with this.
In POI3.7beta3 the problem can be fixed within the XWPFRun class, toString
method.
I think this code:
if ("w:cr".equals(tagName)) {
text.append("\n");
}
...should read:
if ("w:br".equals(tagName)) {
text.append("\n");
}
As far as I know the docx format does not contain a <w:cr> tag and this is an
error. If there is a <w:cr> tag then the extra code for br should be added onto
the method.
In POI3.6 the problem can be fixed within the XWPFParagrah class.
The constructor method builds the output string from the docx tags but does not
account for the <w:br> tags.
after this piece of code:
if (o instanceof CTPTab) {
text.append("\t");
}
another if statement should be added to say something along the lines of:
if (o instanceof CTPBr) {
text.append("\n");
}
I need this fix quite quickly so would be very grateful if somebody could help
me to add this fix to POI and compile with tika. This is my first attempt at
contributing to an open-source project so I am not familiar with how this works.
Thanks,
Glen Thomas