[jira] [Commented] (TIKA-692) TikaCLI -x or -h on a Word doc sometimes adds newline after tag

Michael McCandless (JIRA) Sun, 21 Aug 2011 07:21:50 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088372#comment-13088372
 ]


Michael McCandless commented on TIKA-692:
-----------------------------------------


{quote}
bq. You mean because the other (non-Word) parsers generally embed newlines / 
whitespace in their output'd text, themselves?

Yes. The general idea behind that is that even the plain text output should 
reasonably well reflect the semantic structure of the input document. Thus we 
add things like extra lines between paragraphs or tabs between table cells. The 
nice side-effect is that the output usually becomes also quite human-readable. 
Quite a bit of this logic is inside the XHTMLContentHandler class, so it's 
already shared by many parsers, and we could well extend it with things like 
automatic word-wrapping, etc.
{quote}

OK I think that makes sense... thanks for the explanation!  I do love
the readability of the plain text output when I filter eg a PDF...

In fact in the original user thread (that lead to opening this), I
believe the user was bothered because the filtered text for the Word
doc did not reflect the wrapping that Word does when it renders.

So maybe we can somehow figure out how to have the Word parser(s) do
this... though this is likely a (complex) render-time only thing for
Word.

{quote}
bq. Hmm... I don't think we should be injecting ad-hoc newlines ourselves here?

Why not? I don't see any downsides to extra whitespace within <head> as it's by 
definition ignored by automated processing tools.
{quote}

Well generally it makes me nervous if Tika introduces changes over the
original content?  Ie, I think Tika should strive to be as
"pass-through" as is reasonably possible... just pure plumbing.

Still, I agree that in the head section whitespace will never be
considered part of the content, so it should be OK here.  Plus we are
basically creating that head section ourselves, from the extracted
metadata.

Separately, I'd still like to add a -prettyPrint option to TikaCLI (to
optionally turn INDENT back on, only applying when -x or -h is used,
defaulting to off): I think this is very useful for ad-hoc debugging,
where a user/dev wants to quickly see the XML/XHTML structure that
Tika derived from the content.  I'm happy to cons up the patch for
this...


> TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
> --------------------------------------------------------------------
>
>                 Key: TIKA-692
>                 URL: https://issues.apache.org/jira/browse/TIKA-692
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: 
> 0001-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch, 
> 0002-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch, 
> TIKA-692.patch, TIKA-692.patch, testWORD_bold_character_runs.doc, 
> testWORD_bold_character_runs2.doc
>
>
> [Note: spinoff from the tika-dev thread "Issue in text extraction in
> Solr / Tika" on Aug 19 2011, by nirnaydewan]
> When parsing a Word doc where some contiguous text is bolded, due to
> differences in how the user had bolded different parts of the text
> with Word, TikaCLI -x or -h will sometimes generate output like this:
> {noformat}
> <p>F<b>oob</b>a<b>r</b>
> </p>
> {noformat}
> and other times like this (extra newline & 2 adjacent bold sections):
> {noformat}
> <p>F<b>oo</b>
> <b>b</b>a<b>r</b>
> </p>
> {noformat}
> The extra newline in the second example causes browsers (I tried
> Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
> insert a space when rending/extracting text, breaking up the word.
> While this might be technically correct/OK (ie, XML white space rules
> might allow for non-significant space after the </b> within a <p>
> should be ignored), I think we should still fix Tika to not insert
> newlines, if we can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-692) TikaCLI -x or -h on a Word doc sometimes adds newline after tag

Reply via email to