Hi Bart,
I know very little about ODF, so just some general comments below...
On Sep 25, 2010, at 7:56am, Hanssens Bart wrote:
Hi,
I'm planning to further improve the ODF support in Tika. A few
questions though,
that might also be useful for other formats:
Should Tika parse deleted text ? XHTML has INS and DEL, but they are
to be used
where the content is removed / inserted, while ODF stores removed
content at the
very beginning of the document (so "fixing" this will hurt
performance, not sure if
that's worth it)
It can also be very confusing for the end user if one gets a result
for "removed",
then again, it is somewhere in the document...
If the above is similar to what you get when tracking changes in say
Word, then I would argue for not including the text.
My rule of thumb would be that if the text doesn't appear in "normal"
viewing mode (whatever that means) using a typical app, then it's more
confusing to include it.
Forms: most form elements in ODF can be mapped to their HTML
counterparts,
although I have to check if the result is always valid HTML (i.e.,
when ODF parent
and form element are mapped to HTML, is the HTML form still allowed
within the
mapped parent)
Should they be mapped to HTML forms in the first place ? Or just to
div / span ?
I wouldn't worry about trying to map explicitly to HTML forms -
capturing the text is 99% of the value here, versus trying to maintain
greater logical consistency between ODF and XHTML.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g