Hi to all,

I just wanted to share with you some of my notice about HTMLUnit

> I do not completely understand the problem being solved using this
> library, and what kind of GWT support is discussed.

Aim is to make apache-rat-pd work with Koders.com, GoogleCodeSearch
and Krugle.com codesearch engines.
Unfortunately, only GoogleCodeSearch provides library with API to use
this engine programmatically. Koders and Krugle don't provide something
like that. GoogleCode search engine is very powerful and has great
regex support, but in gdata-codesearch API is missing 2 things: way to
get source file and way to get language of source file.
I found that HTMLUnit can easily do this missing things.
GWT support matters only for GoogleCodeSearch, almost all google's
sites are written using GWT. GoogleCodeSearch too. HTMLUnit works
great with  GoogleCodeSearch. I was wondering if there are some other libraries
which can do it so well.

> Is this to later parse the entire file and analyse it using more
> heavyweight heuristics than regexps? What kind of heuristics are they?

If we have source file loaded in our application, we can do any
heuristic check we can imagine.
Now, we can only ask GoogleCodeSearch is something found using limited
regex( and this is much more freedom then others search engines
provide). Returned informations are list of matched code parts(singleline) and
link to a site where we can see matched code.   If we have source file
available for post-processing, we will be able to, at least, do
Levenshtein distance analyze and check if there are only names of
identificators changed. We can not do it now. We can then show
matched code in our reporting tool without need to watch it in
GoogleCodesearch site.

> My first concern is obviously the size of the dependency, the example
> zip archive was about 7 MB, which is probably too much :) Another
> concern is that waiting for javascript to be interpreted in java is a
> very unreliable process.

I totally agree. 7 MB is large dependency size. I hope that we can
find better solution :(

I think that using Krugle advanced search will be very
difficult without using some library which supports JavaScript.
HTMLUnit can do it. Parsing JavaScript is a slow process, but again it
is much faster
then watching it in web browser :)

So, advantages of HTMLUnit are:

-It can provide to us all informations we are interested in.
-It supports all three of code search engines
-Code written for scraping data using HTMLUnit is less or more
readable and maintenanceable.
-HTMLUnit is stable project and it is very popular in past years [1].
-It has Apache license
-it is already mavenized

Disadvantages are:

-HTMLUnit is very large
-It is used mainly to test web-pages, not to get informations from it.
-probably there are other difficulties with using it which I don't
notice so far...

Because of that and because it is tool of choice when people choose
web-site scraping API, I thought that we can use it in our tool. I
thought about HTMLUnit size problem - we can make these parsers which
are using HTMLUnit to be optional parts of apache-rat-pd(some
pluginable architecture can be used)

Of course, we can search more and eventually found an alternative to this
library, maybe some of this:
http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java
or this: http://sourceforge.net/projects/nekohtml/

Best regards,
Marija

[1]
http://www.expertaya.com/2009/01/23/java-screen-scraping-library/
http://stackoverflow.com/questions/422913/autogenerate-http-screen-scraping-java-code
http://9mmedia.com/blog/?p=321

On Mon, Aug 10, 2009 at 10:18 AM, Egor Pasko<[email protected]> wrote:
> 2009/8/10 Marija Šljivović <[email protected]>:
>> Hi
>>
>> For some time I am working on a implementation of Koders [1] search parser.
>> Aim is to make working and maintenanceable version of koders engine parser.
>> It must have ability to load koders.com webpage, to enter query and
>> parse retrieved result page - and to extract useful information from
>> it.
>> Sometimes it can be more then one page with results to check. This
>> code must be easy to maintenance and change(if site is changed).
>>
>> Same thing must be done with Krugle code search [2]
>> With Krugle there is another option called "advanced search". It can
>> be used for large code part search.
>>
>> With Google Code search it is easy because google have library to
>> access that service.
>>
>> After research I found a library which can provide us ability to
>> access this sites.
>> This tool is HTMLUnit. [3]
>> It is "GUI-Less browser for Java programs". It provide api to access
>> any interesting information on webpage even if it have a lot of
>> javascript.
>> With it I already can parse koders code search result page and read
>> code from GoogleCodesearch (GWT is supported )which can not be
>> regularly be retrieved by gdata-codesearch api. Gdata-codesearch api
>> does not have support to retrieve language of search result but using
>> HTMLUnit it is possible.
>
> I do not completely understand the problem being solved using this
> library, and what kind of GWT support is discussed.
>
> Is the main problem to fetch full file from google code search? Is
> this to later parse the entire file and analyse it using more
> heavyweight heuristics than regexps? What kind of heuristics are they?
>
> My first concern is obviously the size of the dependency, the example
> zip archive was about 7 MB, which is probably too much :) Another
> concern is that waiting for javascript to be interpreted in java is a
> very unreliable process.
>
> So, for me it seems that fetching coders.com and parsing html (with
> probably xpath) is a more convenient way. To download original files
> from koders.com it is absolutely not required to execute javascript.
> For google code search we may often live with snippets returned by
> gdata API.
>
> To me it is kind of a case where a heavyweight 'aimed-to-be-universal'
> approach does not win.
>
> --
> Egor Pasko
>

Reply via email to