Re: rat-1-cutnpaste - Code Search Optimization

Alexei Fedotov Thu, 02 Apr 2009 13:56:29 -0700

Amila, thanks.
In addition to the proposal writing skill I encourage you to
demonstrate your coding skills for those who vote for your acceptance
as a GSoC participant. For example, you can write a class which would
parse cut&paste detector arguments into some internal representation.
It should contain a main method, a loop to cycle throwgh arguments,
and a usage message. The latter is actually a lightweight way to
approach an architectural question of the tool scope.


Thanks.




2009/4/2 Amila De Silva <[email protected]>:
> Hi Alexei,
> Thanks for the reply!
> I'll send my application asap.
> BR,
> Amila
>
>
> On 4/1/09, Alexei Fedotov <[email protected]> wrote:
>> Amila,
>> I'm sorry, I have unintentionally marked your mail as read. Please
>> don't hesitate to ping me again if there is no answer.
>>
>> Your method would do the job. Let me just add that making sliding
>> window size automatically adjustable would have the same linear
>> algorithm complexity, so it might be a proper investment.
>>
>> Please send a proposal to the official GSoC app now.
>>
>> Thanks!
>>
>>
>> On Fri, Mar 27, 2009 at 6:49 PM, Amila De Silva <[email protected]> wrote:
>>> Hi,
>>> I did a bit search on code search engines (Google Code Search, Krugel
>>> and Koder) to find out a scalable solution.
>>> As the first step we can set an initial size for the sliding
>>> window(this size can be changed by the user).
>>> When a long string is sent to the search engine, it will be tokenized
>>> before searching. As I understood there is a limit of tokens they
>>> create; if the query string is too long ,after a certain amount of
>>> tokens the rest of the string will be considered as a single token.
>>> If we can get this number of tokens , its better to set this as the
>>> window length(so that window contains that much of tokens).
>>> Let's say this size is n. If a query fails to find any result, then
>>> whole n tokens will be removed and the next n tokens will be loaded
>>> and the search will be performed again.
>>> If this query returns any result those URLs will be recorded (I think
>>> it's better to take first 3 or 4 URLs only).
>>> Even the query returns any result the next n tokens will be newly loaded.
>>> By this way the whole code can be searched much quickly , preserving
>>> search engine resources. After a list of URLs has been prepared
>>> in-depth search can be performed.
>>>
>>> I'd like to hear your comments on this methods.
>>>
>>> Best Regards,
>>> Amila
>>>
>>
>>
>>
>> --
>> With best regards / с наилучшими пожеланиями,
>> Alexei Fedotov / Алексей Федотов,
>> http://www.telecom-express.ru/
>> http://people.apache.org/~aaf/
>>
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/

Re: rat-1-cutnpaste - Code Search Optimization

Reply via email to