Hello,
 
I am working on an urgent project using Tika and am having problems with POI's 
extraction of content from Word docx files.
 
When POI finds a <w:p> tag in the document.xml file it adds a "\n" to the 
string output as expected. But when POI finds a <w:br> tag, it does nothing, 
which is causing words in the text to be merged together rather than on 
different lines.
 
I have located the source of the problem in versions 3.6 and 3.7beta3, but I am 
not a greatly experienced developer and could use some help with this.
 
In POI3.7beta3 the problem can be fixed within the XWPFRun class, toString 
method.
 
I think this code:
if ("w:cr".equals(tagName)) {
    text.append("\n");
}
 
...should read:
if ("w:br".equals(tagName)) {
    text.append("\n");
}
 
As far as I know the docx format does not contain a <w:cr> tag and this is an 
error. If there is a <w:cr> tag then the extra code for br should be added onto 
the method.
 
In POI3.6 the problem can be fixed within the XWPFParagrah class.
 
The constructor method builds the output string from the docx tags but does not 
account for the <w:br> tags.
 
after this piece of code:
if (o instanceof CTPTab) {
    text.append("\t");
}
 
another if statement should be added to say something along the lines of:
if (o instanceof CTPBr) {
    text.append("\n");
}
 
I need this fix quite quickly so would be very grateful if somebody could help 
me to add this fix to POI and compile with tika. This is my first attempt at 
contributing to an open-source project so I am not familiar with how this works.
 
Thanks,
 
Glen Thomas                                       

Reply via email to