2009/8/10 Marija Šljivović <[email protected]>:
> Hi
>
> For some time I am working on a implementation of Koders [1] search parser.
> Aim is to make working and maintenanceable version of koders engine parser.
> It must have ability to load koders.com webpage, to enter query and
> parse retrieved result page - and to extract useful information from
> it.
> Sometimes it can be more then one page with results to check. This
> code must be easy to maintenance and change(if site is changed).
>
> Same thing must be done with Krugle code search [2]
> With Krugle there is another option called "advanced search". It can
> be used for large code part search.
>
> With Google Code search it is easy because google have library to
> access that service.
>
> After research I found a library which can provide us ability to
> access this sites.
> This tool is HTMLUnit. [3]
> It is "GUI-Less browser for Java programs". It provide api to access
> any interesting information on webpage even if it have a lot of
> javascript.
> With it I already can parse koders code search result page and read
> code from GoogleCodesearch (GWT is supported )which can not be
> regularly be retrieved by gdata-codesearch api. Gdata-codesearch api
> does not have support to retrieve language of search result but using
> HTMLUnit it is possible.

I do not completely understand the problem being solved using this
library, and what kind of GWT support is discussed.

Is the main problem to fetch full file from google code search? Is
this to later parse the entire file and analyse it using more
heavyweight heuristics than regexps? What kind of heuristics are they?

My first concern is obviously the size of the dependency, the example
zip archive was about 7 MB, which is probably too much :) Another
concern is that waiting for javascript to be interpreted in java is a
very unreliable process.

So, for me it seems that fetching coders.com and parsing html (with
probably xpath) is a more convenient way. To download original files
from koders.com it is absolutely not required to execute javascript.
For google code search we may often live with snippets returned by
gdata API.

To me it is kind of a case where a heavyweight 'aimed-to-be-universal'
approach does not win.

-- 
Egor Pasko

Reply via email to