Peter Davies created TIKA-2640:
----------------------------------

             Summary: MS Word document checkboxes and dropdowns not fully 
converted to text
                 Key: TIKA-2640
                 URL: https://issues.apache.org/jira/browse/TIKA-2640
             Project: Tika
          Issue Type: Improvement
          Components: core
    Affects Versions: 1.18
         Environment: [^MSWordDocWithCheckboxesAndDropdowns.doc]
            Reporter: Peter Davies
         Attachments: MSWordDocWithCheckboxesAndDropdowns.doc

When we use Tika to parse the text from a Microsoft Word document (.doc) file 
with a check box we get +FORMCHECKBOX+ with no indication as to whether it is 
checked or not.

When the doc has a dropdown menu we get _FORMDROPDOWN_ with no indication as to 
which was selected.

If we parse to XHTML instead we still get e.g.

 
{code:java}
<tr> <td><p class="header">Another kind of incident</p>
</td> <td><p class="header"><a name="__Fieldmark__23_1777734196" /><a 
name="__Fieldmark__23_1777734196" /><a name="__Fieldmark__23_1777734196" 
/>|_|</p>
</td> <td><p />
</td></tr>
 
{code}
even though the checkbox is ticked in the doc (checkboxes always show *_|_|_*).

Is there a way that Tika can be configured to return text showing what was 
selected in each case?

Our code:

 
{code:java}
InputStream stream = this.getClass().getResourceAsStream("/" + 
EXPECTED_LOCATION + fileName);
String text = new Tika().parseToString(stream, new Metadata(), -1).trim();
{code}
 

I have attached an example MS Word doc file with checkboxes and a dropdown.

Regards and thanks, Pete

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to