Hi, I did a bit search on code search engines (Google Code Search, Krugel and Koder) to find out a scalable solution. As the first step we can set an initial size for the sliding window(this size can be changed by the user). When a long string is sent to the search engine, it will be tokenized before searching. As I understood there is a limit of tokens they create; if the query string is too long ,after a certain amount of tokens the rest of the string will be considered as a single token. If we can get this number of tokens , its better to set this as the window length(so that window contains that much of tokens). Let’s say this size is n. If a query fails to find any result, then whole n tokens will be removed and the next n tokens will be loaded and the search will be performed again. If this query returns any result those URLs will be recorded (I think it’s better to take first 3 or 4 URLs only). Even the query returns any result the next n tokens will be newly loaded. By this way the whole code can be searched much quickly , preserving search engine resources. After a list of URLs has been prepared in-depth search can be performed.
I’d like to hear your comments on this methods. Best Regards, Amila
