Hi Talat,

thanks for the examples. I've also observed that Neko has some problems
even with valid HTML5. Luckily, most pages do not use excessively the
syntactic "freedom" HTML5 allows (not closing tags, leaving "implicit" tags
away). Some problems can be easily fixed (eg., NUTCH-1733), and since
Neko isn't a dead projects, it may support HTML5 in the future.

Tika (and also parse-tika) seems to successfully parse your example.
Are there any problems with Tika? It would be worth to include it
into any comparisons.

If really needed, I see no problem to add a new parser implementation
to parse-html. Of course, the license must be compatible (in case of
Jsoup: MIT). And with Gumbo we would have to include a native lib.

If the parser follows the standard, you "sleep better". But there
are many other things in Nutch which are worth to take care of.
If you need it, go ahead and try to integrate one of the
HTML5/whatwg-standard-conform parsers.

Sebastian


On 05/05/2014 02:02 PM, Talat Uyarer wrote:
> Hi Lewis and Sebastian,
> 
> First of all thanks for reply :) There is not any issue in our Jira.
> But I detected a lot of website that has html tags in parsed text.
> 
> For example 
> http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp#.U2c6H3V_t2M
> 
> When it is parsed by Neko, its parsed text has html tags.
> (http://paste.apache.org/2afD) However when you parse with gumbo or
> jsoup, it is parsed correctly. (http://paste.apache.org/7FrE)
> Moreover i tried to search some text from unparsed part of this page
> like 
> "site:http://www.dersimiz.com/kisa-ilginc-enteresan-tuhaf-acayip-sasirtici-bilgiler.asp
> ÖNCEKİ" query on Google. (http://tinyurl.com/l33vg5j) I don't see any
> html tag in snippet. Because of wrong html tag usage, Tagsoup or Neko
> didn't build DOM tree.
> 
> IMHO we can use existing parse-html plugin. We can add a different
> implementation backend such as neko or tagsoup. Neko and tagsoup are
> good parsers. But They focused to HTML4. But Web is changing. HTML5
> parser can cover HTML4 structure. Gumbo is developed by Google
> (https://github.com/google/gumbo-parser/) Jsoup and Gumbo implemented
> whatwg.com specification
> (http://www.whatwg.org/specs/web-apps/current-work/multipage/) IMHO we
> can build better parser with these. I can implement these. But I want
> to learn what is our expectation for default parser ?
> 
> Talat
> 
>>> Now used parser plugins nekohtml doesnt parse correctly.
>>
>>
>> What is wrong with it? Are there any issues in Jira to back this up?
>>
>>>
>>> When I tested
>>> in huge website site, it leaves html tags.
>>
>>
>> Pretty vague. Anything else? Any more details? Can this be implemented in
>> existing parser plugins?
>>
>>>
>>> IMHO our parser is little
>>> bit old.
>>
>>
>> Which one? Is it possible to upgrade? I don't know which parser you mean.
>>
>>>
>>> After doing some research, I found Jsoup[1] and Gumbo[2]
>>> parser.  I did some test on broken websites. I saw gumbo and jsoup
>>> parsed very similar Google's parser.
>>>
>> So what are the benefits? If we have a clear cut argument then lets go for
>> it. If not then maybe your time would be better invested elsewhere. It's up
>> to you I suppose :)
>>
> 
> 
> 

Reply via email to