Hi Lewis and Sebastian,

First of all thanks for reply :) There is not any issue in our Jira.
But I detected a lot of website that has html tags in parsed text.

For example 
http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp#.U2c6H3V_t2M

When it is parsed by Neko, its parsed text has html tags.
(http://paste.apache.org/2afD) However when you parse with gumbo or
jsoup, it is parsed correctly. (http://paste.apache.org/7FrE)
Moreover i tried to search some text from unparsed part of this page
like 
"site:http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp
ÖNCEKİ" query on Google. (http://tinyurl.com/l33vg5j) I don't see any
html tag in snippet. Because of wrong html tag usage, Tagsoup or Neko
didn't build DOM tree.

IMHO we can use existing parse-html plugin. We can add a different
implementation backend such as neko or tagsoup. Neko and tagsoup are
good parsers. But They focused to HTML4. But Web is changing. HTML5
parser can cover HTML4 structure. Gumbo is developed by Google
(https://github.com/google/gumbo-parser/) Jsoup and Gumbo implemented
whatwg.com specification
(http://www.whatwg.org/specs/web-apps/current-work/multipage/) IMHO we
can build better parser with these. I can implement these. But I want
to learn what is our expectation for default parser ?

Talat

>> Now used parser plugins nekohtml doesnt parse correctly.
>
>
> What is wrong with it? Are there any issues in Jira to back this up?
>
>>
>> When I tested
>> in huge website site, it leaves html tags.
>
>
> Pretty vague. Anything else? Any more details? Can this be implemented in
> existing parser plugins?
>
>>
>> IMHO our parser is little
>> bit old.
>
>
> Which one? Is it possible to upgrade? I don't know which parser you mean.
>
>>
>> After doing some research, I found Jsoup[1] and Gumbo[2]
>> parser.  I did some test on broken websites. I saw gumbo and jsoup
>> parsed very similar Google's parser.
>>
> So what are the benefits? If we have a clear cut argument then lets go for
> it. If not then maybe your time would be better invested elsewhere. It's up
> to you I suppose :)
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Reply via email to