[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733776#comment-13733776 ]
Uwe Schindler commented on SOLR-4679: ------------------------------------- Hoss: I just took this issue because it was unassigned and I was the one mandating to add ignorable whitespace at that time in TIKA. So Jukka and I decided this would be the best. Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems: - TIKA decided to use XHTML as its output format to report the parsed documents to the consumer. This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...) originating from the original document. Of course most of this formatting is lost, but you can still "detect" things like emphasized text. By choosing XHTML as output format, of course TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed, the TIKA pasrer emits a <br/> tag or places the "paragraph" (in a PDF) inside a <p/> element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated as one single whitespace, so means like this regreplace: {{s/\s+/ /}} - On the other hand, TIKA wants to make it simple for people to extract the *plain text* contents. With the XHTML-only approach this would be hard for the consumer. Because to add the correct newlines, the consumer has to fully understand XHTML and detect block elements and replace them by \n To support both usages of TIKA the idea was to embed this information which is unimportant to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as "convenience" for the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands HTML it would detect a <p> element as a block element and format the output. Solr unfortunately has some strange approach: It is mainly interested in the text only contents, so ideally when consuming the HTLL it could use {{WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the right thing automatically: It would extract only text from the body element and would use the "convenience whitespace" to format the text in ASCII-ART-like way (using tabs, newlines,...) :-) Solr has a hybrid approach: It collects all into a content tag (which is similar to the above approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack where it allows to process parts of the documents (like the title element or all <em> elements). In that case it has several StringBuilders in parallel that are populated with the contents. The problems are here too, but cannot be solved by using ignorable whitespace: e.g. one indexes only all <em> elements (which are inline HTML elements no block elements), there is no whitespace so all em elements would be glued together in the em field of your index... I just mention this, in my opinion the SolrContentHandler needs more work to "correctly" understand HTML and not just collect element names in a map! Now to your complaint: You proposed to report the newlines as real {{character()}} events - but this is not the right thing to do here. As I said, HTML does not know these characters, they are ignored. The "formatting" is done by the element names (like <p>, <div>, <table>). So the "helper" whitespace for text-only consumers should be inserted as ignorableWhitespace only, if we would add it to the real character data we would report things that every HTML parser (like nekohtml) would never report to the consumer. Nekohtml would also report this useless extra whitespace as ignorable. The convenience here is that TIKA's XHTMLContentHandler used by all parsers is "configured" to help the text-only user, but don't hurt the HTML-only user. This differentiation is done by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but also report the ASCII-ART-text-only content like TABs indide tables, newlines after block elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser must ignore it - and its correct to do this. > HTML line breaks (<br>) are removed during indexing; causes wrong search > results > -------------------------------------------------------------------------------- > > Key: SOLR-4679 > URL: https://issues.apache.org/jira/browse/SOLR-4679 > Project: Solr > Issue Type: Bug > Components: update > Affects Versions: 4.2 > Environment: Windows Server 2008 R2, Java 6, Tomcat 7 > Reporter: Christoph Straßer > Assignee: Uwe Schindler > Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, > Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png > > > HTML line breaks (<br>, <BR>, <br/>, ...) seem to be removed during > extraction of content from HTML-Files. They need to be replaced with a empty > space. > Test-File: > <html> > <head> > <title>Test mit HTML-Zeilenschaltungen</title> > </head> > <p> > word1<br>word2<br/> > Some other words, a special name like linz<br>and another special name - > vienna > </p> > </html> > The Solr-content-attribute contains the following text: > Test mit HTML-Zeilenschaltungen > word1word2 > Some other words, a special name like linzand another special name - vienna > So we are not able to find the word "linz". > We use the ExtractingRequestHandler to put content into Solr. > (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org