Hi, I'm planning to further improve the ODF support in Tika. A few questions though, that might also be useful for other formats:
Should Tika parse deleted text ? XHTML has INS and DEL, but they are to be used where the content is removed / inserted, while ODF stores removed content at the very beginning of the document (so "fixing" this will hurt performance, not sure if that's worth it) It can also be very confusing for the end user if one gets a result for "removed", then again, it is somewhere in the document... Forms: most form elements in ODF can be mapped to their HTML counterparts, although I have to check if the result is always valid HTML (i.e., when ODF parent and form element are mapped to HTML, is the HTML form still allowed within the mapped parent) Should they be mapped to HTML forms in the first place ? Or just to div / span ? Best regards Bart
