Re: Folder Comparison with Percentage Similarity?

Alan Halls Thu, 28 Sep 2017 09:01:09 -0700

Thanks Jag! I will certainly look into Levenshtein!

I found this tool here (https://www.safe-corp.com/products_codematch.htm)
but it costs up to $400/MB (
https://www.safe-corp.com/documents/CodeSuite%20pricing.pdf) and seemed
like something Meld would be perfect for with minimal effort, and it seemed
like Meld could attract a whole new group of power users, and maybe even
some with some funding behind them to improve Meld.


I have a .NET programmer part time that is coming by this afternoon that I
may have look at extracting those stats - but not sure how realistic it is
as an afternoon project for someone not familiar with the code base.

Alan

On Thu, Sep 28, 2017 at 9:33 AM, Jaggz H <[email protected]> wrote:

> Halls,
>
> 1. You might do yourself some good coding of your own, if you can --
> possibly using a combination of shell/coding. I'd recommend you doing this,
> assuming you're the one in the right :), because you'll be able to get the
> custom stats needed for strength in your case, without being limited to
> someone else's tools.
> 2. That being said, maybe a few stats would be useful to some people in
> meld. I wonder if kdiff3 outputs stats. kdiff3 is another GUI diff-merge
> tool. I use meld and kdiff3.
> 3. Also, maybe look into the Levenshtein text difference algorithm. In
> Perl I use
> Text::Levenshtein (_XS). It provides a character-distance between two
> texts (ie. how many single-character edits are needed to make one into the
> other), which then readily translates to a percentage. In that respect,
> it's more literally-related to the amount of change than line counts.
>
> Jag
>
> On Sep 28, 2017 7:09 AM, "Alan Halls" <[email protected]> wrote:
>
>> Thanks Phil for the response, I guess I was thinking of a debug report
>> such as:
>> Files Analyzed:19,543
>> Folders Analyzed:343
>> Total lines of code analyzed: 1,544,346
>> Total lines of code in source: 1,244,346
>> Total lines of code in destination: 1,944,346
>> Total lines with exact matches: 856,644
>> Unique lines in source: 400,546
>> Unique lines in destination: 850,546
>> Similarity of source to destination: 45%
>> Exact matches of greater than 25 contiguous lines of code: 943
>> Exact matches of greater than 5 contiguous lines of code: 46,733
>>
>> I looked into the plagiarism-detector tools and haven't found anything
>> yet that does PHP, and the command line diff tools "should" be able to
>> output this type of report, I just figured that all of this info, with the
>> exception of the last 2 would be already tracked in the software and just
>> need to be output somewhere.
>>
>> Alan
>>
>> On Wed, Sep 27, 2017 at 4:14 PM, Phil Hord <[email protected]> wrote:
>>
>>> Alan,
>>>
>>> Tools already exist that more directly meet your need.  Any unix-like
>>> system will have command-line tools to do most of this analysis.  I'd start
>>> with "diff -b -B -w", but you can also use "comm".  The comm tool relies on
>>> the files being sorted, though, so you might want to ignore "empty" lines
>>> or common lines like </head>, for example.
>>>
>>> There are some plagiarism-detector tools that may also help, but I don't
>>> have any experience with those.
>>>
>>> Feel free to contact me off-list if you need more specific guidance.
>>> Phil
>>>
>>>
>>> On Wed, Sep 27, 2017 at 2:49 PM Alan Halls <[email protected]> wrote:
>>>
>>>> I am involved in a legal matter regarding an employees theft of trade
>>>> secrets. In particular he stole the source code for a website that he and 2
>>>> other programmers worked on for 2 years.
>>>>
>>>> I now have a copy of his project, and of course a copy of mine. I found
>>>> the software Meld which seems to do a great job on a one by one basis, but
>>>> it would be very time consuming to try to end up with any "score" of how
>>>> much of our original code is still in his existing project.
>>>>
>>>> He was sloppy and his launched public website still has our company
>>>> info in the 404 page, which links you to the about us, pricing, docs,
>>>> contact us pages ---- which all still have the original code in them, so
>>>> there is no question about whether or not he did, just how much "custom"
>>>> work did he do for himself.
>>>>
>>>> I was kind of imagining a report with a total score, then the top 50
>>>> matches with each of their scores. Has anyone thought of adding that in? It
>>>> seems that all that info would be available already in the program, just
>>>> needing a view for it to display on.
>>>>
>>>> _______________________________________________
>>>> meld-list mailing list
>>>> [email protected]
>>>> https://mail.gnome.org/mailman/listinfo/meld-list
>>>
>>>
>>
>> _______________________________________________
>> meld-list mailing list
>> [email protected]
>> https://mail.gnome.org/mailman/listinfo/meld-list
>>
>

_______________________________________________
meld-list mailing list
[email protected]
https://mail.gnome.org/mailman/listinfo/meld-list

Re: Folder Comparison with Percentage Similarity?

Reply via email to