Hi Talat, parse-html uses neko per default, or as alternative tagsoup. Tagsoup is also used by parse-tika. Which parser lib is used internally by parse-html can be set via property "parser.html.impl". It will not harm to have more libs available (if they are compatible, also regarding license). If one of them really performs better (in quality and performance) we can change the default. But, I don't expect a clear-cut result: one lib may be faster, the other more robust, the third adapts well to HTML5, etc.
What do you mean by "Google's parser"? Sebastian On 05/03/2014 01:25 AM, Talat Uyarer wrote: > Hi all, > > Now used parser plugins nekohtml doesnt parse correctly. When I tested > in huge website site, it leaves html tags. IMHO our parser is little > bit old. After doing some research, I found Jsoup[1] and Gumbo[2] > parser. I did some test on broken websites. I saw gumbo and jsoup > parsed very similar Google's parser. > > Wdyt ? > > [1] http://jsoup.org/ > [2] https://github.com/google/gumbo-parser >

