[jira] [Commented] (TIKA-692) TikaCLI -x or -h on a Word doc sometimes adds newline after tag

Michael McCandless (JIRA) Sun, 21 Aug 2011 06:21:54 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088360#comment-13088360
 ]


Michael McCandless commented on TIKA-692:
-----------------------------------------

bq. The <body> content emitted by many parsers is already designed to be 
reasonably readable, so the only thing that gets badly mixed without automatic 
indenting is the <head> section and all the metadata contained in it.

You mean because the other (non-Word) parsers generally embed
newlines / whitespace in their output'd text, themselves?  EG, when I
tested a random PDF I have I can see added newlines and the body text
is very readable as is.

But the Word parsers don't seem to do this...

And I don't think we should rely on/expect/require the parsers to be
pretty-printing their body text output?  Ie, the PDFParser's output is
readable because it does very little actual markup -- big chunks of
pre-formatted (with inserted whitespace) text between <p>..</p> tags,
while the Word parsers seem to do the opposite (relatively large
amounts of markup).

bq. The attached 0002 patch modifies the XHTMLContentHandler class to add extra 
whitespace within the <head> section to make the output more readable.

Hmm... I don't think we should be injecting ad-hoc newlines ourselves
here?  In general I think Tika should do as little whitespace
manipulation as possible (I'm still not sure why/where the RTFParser
is adding whitespace)?

I think it's important that TikaCLI's output be human readable -- this
is a very useful ad-hoc debugging tool.

I hope we can find some way to use the serializer on TIKA-651.


> TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
> --------------------------------------------------------------------
>
>                 Key: TIKA-692
>                 URL: https://issues.apache.org/jira/browse/TIKA-692
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: 
> 0001-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch, 
> 0002-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch, 
> TIKA-692.patch, TIKA-692.patch, testWORD_bold_character_runs.doc, 
> testWORD_bold_character_runs2.doc
>
>
> [Note: spinoff from the tika-dev thread "Issue in text extraction in
> Solr / Tika" on Aug 19 2011, by nirnaydewan]
> When parsing a Word doc where some contiguous text is bolded, due to
> differences in how the user had bolded different parts of the text
> with Word, TikaCLI -x or -h will sometimes generate output like this:
> {noformat}
> <p>F<b>oob</b>a<b>r</b>
> </p>
> {noformat}
> and other times like this (extra newline & 2 adjacent bold sections):
> {noformat}
> <p>F<b>oo</b>
> <b>b</b>a<b>r</b>
> </p>
> {noformat}
> The extra newline in the second example causes browsers (I tried
> Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
> insert a space when rending/extracting text, breaking up the word.
> While this might be technically correct/OK (ie, XML white space rules
> might allow for non-significant space after the </b> within a <p>
> should be ignored), I think we should still fix Tika to not insert
> newlines, if we can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-692) TikaCLI -x or -h on a Word doc sometimes adds newline after tag

Reply via email to