Hi, Tom, You're right, I don't think one can argue that the default parser should know HTML. How about your suggestion of there being an HTML parser, is it feasible? I ask this because I think that a lot of people store HTML documents these days, and although there probably aren't lots of HTML with words written along multiple inline elements, it would certainly be nice to have a proper parser for these use cases.
What do you think? On Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Marcelo Zabani <mzab...@gmail.com> writes: > > I was here wondering whether HTML parsing should separate tokens that are > > not separated by spaces in the original text, but are separated by an > > inline element. Let me show you an example: > > > *SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are > > <strong>n</strong>i<em>ce</em>')* > > *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"* > > > "Hello" and "neighbor" should really be separated, because *<p>* is a > block > > element, but "nice" should be a single word there, since there is no > visual > > separation when rendered (*<em>* and *<strong>* are inline elements). > > I can't imagine that we want to_tsvector to know that much about HTML. > It doesn't, really, even have license to assume that its input *is* > HTML. So even if you see things that look like <foo> and </foo> in the > string, it could easily be XML or SGML or some other SGML-like markup > format with different semantics for the markup keywords. > > Perhaps it'd be sane to do something like this as long as the > HTML-specific behavior was broken out into a separate function. > (Or maybe it could be done within to_tsvector as a separate parser > or separate dictionary?) But I don't think it should be part of > the default behavior. > > regards, tom lane >