Hi Bart,

I know very little about ODF, so just some general comments below...

On Sep 25, 2010, at 7:56am, Hanssens Bart wrote:

Hi,

I'm planning to further improve the ODF support in Tika. A few questions though,
that might also be useful for other formats:

Should Tika parse deleted text ? XHTML has INS and DEL, but they are to be used where the content is removed / inserted, while ODF stores removed content at the very beginning of the document (so "fixing" this will hurt performance, not sure if
that's worth it)
It can also be very confusing for the end user if one gets a result for "removed",
then again, it is somewhere in the document...

If the above is similar to what you get when tracking changes in say Word, then I would argue for not including the text.

My rule of thumb would be that if the text doesn't appear in "normal" viewing mode (whatever that means) using a typical app, then it's more confusing to include it.

Forms: most form elements in ODF can be mapped to their HTML counterparts, although I have to check if the result is always valid HTML (i.e., when ODF parent and form element are mapped to HTML, is the HTML form still allowed within the
mapped parent)
Should they be mapped to HTML forms in the first place ? Or just to div / span ?

I wouldn't worry about trying to map explicitly to HTML forms - capturing the text is 99% of the value here, versus trying to maintain greater logical consistency between ODF and XHTML.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to