Hi Lewis and Sebastian, First of all thanks for reply :) There is not any issue in our Jira. But I detected a lot of website that has html tags in parsed text.
For example http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp#.U2c6H3V_t2M When it is parsed by Neko, its parsed text has html tags. (http://paste.apache.org/2afD) However when you parse with gumbo or jsoup, it is parsed correctly. (http://paste.apache.org/7FrE) Moreover i tried to search some text from unparsed part of this page like "site:http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp ÖNCEKİ" query on Google. (http://tinyurl.com/l33vg5j) I don't see any html tag in snippet. Because of wrong html tag usage, Tagsoup or Neko didn't build DOM tree. IMHO we can use existing parse-html plugin. We can add a different implementation backend such as neko or tagsoup. Neko and tagsoup are good parsers. But They focused to HTML4. But Web is changing. HTML5 parser can cover HTML4 structure. Gumbo is developed by Google (https://github.com/google/gumbo-parser/) Jsoup and Gumbo implemented whatwg.com specification (http://www.whatwg.org/specs/web-apps/current-work/multipage/) IMHO we can build better parser with these. I can implement these. But I want to learn what is our expectation for default parser ? Talat >> Now used parser plugins nekohtml doesnt parse correctly. > > > What is wrong with it? Are there any issues in Jira to back this up? > >> >> When I tested >> in huge website site, it leaves html tags. > > > Pretty vague. Anything else? Any more details? Can this be implemented in > existing parser plugins? > >> >> IMHO our parser is little >> bit old. > > > Which one? Is it possible to upgrade? I don't know which parser you mean. > >> >> After doing some research, I found Jsoup[1] and Gumbo[2] >> parser. I did some test on broken websites. I saw gumbo and jsoup >> parsed very similar Google's parser. >> > So what are the benefits? If we have a clear cut argument then lets go for > it. If not then maybe your time would be better invested elsewhere. It's up > to you I suppose :) > -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

