Hi,
I did a bit search on code search engines (Google Code Search, Krugel
and Koder) to find out a scalable solution.
As the first step we can set an initial size for the sliding
window(this size can be changed by the user).
When a long string is sent to the search engine, it will be tokenized
before searching. As I understood there is a limit of tokens they
create; if the query string is too long ,after a certain amount of
tokens the rest of the string will be considered as a single token.
If we can get this number of tokens , its better to set this as the
window length(so that window contains that much of tokens).
Let’s say this size is n. If a query fails to find any result, then
whole n tokens will be removed and the next n tokens will be loaded
and the search will be performed again.
If this query returns any result those URLs will be recorded (I think
it’s better to take first 3 or 4 URLs only).
Even the query returns any result the next n tokens will be newly loaded.
By this way the whole code can be searched much quickly , preserving
search engine resources. After a list of URLs has been prepared
in-depth search can be performed.

I’d like to hear your comments on this methods.

Best Regards,
Amila

Reply via email to