Re: [emf-dev] EMF Compare Name Similarity

Cédric Brun Tue, 09 Jul 2013 00:04:36 -0700

Hi,

I'll start by a bit of history.

The original submission of emf compare was in fact made of two differentproducts, one from Intalio and another one from Obeo. Both had acomparison engine and a UI, the one from Obeo was more advancedregarding the engine whereas the one from Intalio was more advanced froman UI perspective. [1] refers to the Intalio product (it was actuallypublished before EMF compare got created). The presentation of this workduring the Modeling Symposium in 2006 led to the creation of the projectin 2007. At that time it was decided to keep Obeo's engine (which wasrelying on the dice coefficient ) and Intalio's UI. The Levensteindistance was used by Intalio's engine.

EMF compare 1.3 is indeed similar to [4] and leverage the dicecoefficient quite a lot. The matching strategy is quite different in EMFCompare 2.x but still use the dice coefficient.



In a nutshell :

> I) EMF Compare 1.x and 2.x use the Dice coefficient with bi-grams forstring similarity

That's right

> II) EMF Compare 2.x uses the Longest Common Subsequence to determinechanges in multi-references of EObjects

That's right, and its used for multi-valued attributes too.

> III) a) is wrong/outdated.

It refers to EMF Compare 1.3 (see the URL) and as such is neither wrongnor outdated but there is no complete description of the 2.x algorithmon the wiki.



Le 05/07/2013 14:53, Simon a écrit :

Hi,
at the moment I am reverse engineering EMF Compare and I've alreadyread much material. I think I found some inconsistencies among thematerial and want to task if I understand things right.
That are the statements in question:
a) According to [1] EMF Compare uses Levenshtein distance for stringsimilarity.b) According to [3] EMF Compare 1.3 is similar to [4]. In [4] the Dicecoefficient (although it is not named explicitly) is used for stringsimilarity.
After a code review of [2] and [5], I came to the following conclusions:
I) EMF Compare 1.x and 2.x use the Dice coefficient with bi-grams forstring similarityII) EMF Compare 2.x uses the Longest Common Subsequence to determinechanges in multi-references of EObjects
III) a) is wrong/outdated.

I appreciate if someone can approve my conclusions.




References:
[1]http://eclipsesummit.org/summiteurope2006/presentations/ESE2006-EclipseModelingSymposium10_EMFCompareUtility.pdf
[2]http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/tree/plugins/org.eclipse.emf.compare.match/src/org/eclipse/emf/compare/match/internal/statistic/NameSimilarity.java?h=1.3
[3]http://wiki.eclipse.org/EMF_Compare/FAQ/1.3#What_kind_of_.22strategies.22_use_EMF_compare_.3F
[4] http://ase.cs.uni-due.de/olbib/p54-xing-241.pdf
[5]http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/tree/plugins/org.eclipse.emf.compare/src/org/eclipse/emf/compare/utils/DiffUtil.java?h=2.1
_______________________________________________
emf-dev mailing list
[email protected]
https://dev.eclipse.org/mailman/listinfo/emf-dev


_______________________________________________
emf-dev mailing list
[email protected]
https://dev.eclipse.org/mailman/listinfo/emf-dev

Re: [emf-dev] EMF Compare Name Similarity

Reply via email to