[jira] [Resolved] (TIKA-692) TikaCLI -x or -h on a Word doc sometimes adds newline after tag

Jukka Zitting (JIRA) Sun, 21 Aug 2011 06:37:50 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-692.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

I committed all the patches, thus resolving this as fixed. Michael's style tag 
normalization mechanism should help clean up complicated markup in Word 
documents and the disabled output indenting in Tika CLI will prevent similar 
whitespace issues from popping up with other document formats or other types of 
markup within Word documents.

{quote}
You mean because the other (non-Word) parsers generally embed newlines / 
whitespace in their output'd text, themselves?
{quote}

Yes. The general idea behind that is that even the plain text output should 
reasonably well reflect the semantic structure of the input document. Thus we 
add things like extra lines between paragraphs or tabs between table cells. The 
nice side-effect is that the output usually becomes also quite human-readable. 
Quite a bit of this logic is inside the XHTMLContentHandler class, so it's 
already shared by many parsers, and we could well extend it with things like 
automatic word-wrapping, etc.

{quote}
Hmm... I don't think we should be injecting ad-hoc newlines ourselves here?
{quote}

Why not? I don't see any downsides to extra whitespace within <head> as it's by 
definition ignored by automated processing tools.

> TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
> --------------------------------------------------------------------
>
>                 Key: TIKA-692
>                 URL: https://issues.apache.org/jira/browse/TIKA-692
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: 
> 0001-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch, 
> 0002-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch, 
> TIKA-692.patch, TIKA-692.patch, testWORD_bold_character_runs.doc, 
> testWORD_bold_character_runs2.doc
>
>
> [Note: spinoff from the tika-dev thread "Issue in text extraction in
> Solr / Tika" on Aug 19 2011, by nirnaydewan]
> When parsing a Word doc where some contiguous text is bolded, due to
> differences in how the user had bolded different parts of the text
> with Word, TikaCLI -x or -h will sometimes generate output like this:
> {noformat}
> <p>F<b>oob</b>a<b>r</b>
> </p>
> {noformat}
> and other times like this (extra newline & 2 adjacent bold sections):
> {noformat}
> <p>F<b>oo</b>
> <b>b</b>a<b>r</b>
> </p>
> {noformat}
> The extra newline in the second example causes browsers (I tried
> Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
> insert a space when rending/extracting text, breaking up the word.
> While this might be technically correct/OK (ie, XML white space rules
> might allow for non-significant space after the </b> within a <p>
> should be ignored), I think we should still fix Tika to not insert
> newlines, if we can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-692) TikaCLI -x or -h on a Word doc sometimes adds newline after tag

Reply via email to