Hi For some time I am working on a implementation of Koders [1] search parser. Aim is to make working and maintenanceable version of koders engine parser. It must have ability to load koders.com webpage, to enter query and parse retrieved result page - and to extract useful information from it. Sometimes it can be more then one page with results to check. This code must be easy to maintenance and change(if site is changed).
Same thing must be done with Krugle code search [2] With Krugle there is another option called "advanced search". It can be used for large code part search. With Google Code search it is easy because google have library to access that service. After research I found a library which can provide us ability to access this sites. This tool is HTMLUnit. [3] It is "GUI-Less browser for Java programs". It provide api to access any interesting information on webpage even if it have a lot of javascript. With it I already can parse koders code search result page and read code from GoogleCodesearch (GWT is supported )which can not be regularly be retrieved by gdata-codesearch api. Gdata-codesearch api does not have support to retrieve language of search result but using HTMLUnit it is possible. There is no other library which can parse GWT (GoogleCodeSearch) and other javas cript pages with this amount of a success. HTMLUnit is licensed by Apache2.0 license. It is already mavenised. Only disadvantage of using this library in our code is a lot of project dependencies and it's name, but even if it is mainly used for testing, it can be used very well to retrieve information from web, too. So, I believe that using this library will help us to work with all three parsers in common way. What is your opinion about using HTMLUnit in apache-rat-pd project? On apache-rat-pd project page is sample of using HTMLUnit to parse Koders code search engine. [4] Best regards, Marija [1] http://www.koders.com/ [2] http://www.krugle.com/ [3] http://htmlunit.sourceforge.net/ [4] http://code.google.com/p/apache-rat-pd/downloads/list
