Hi,
I'll start by a bit of history.
The original submission of emf compare was in fact made of two different
products, one from Intalio and another one from Obeo. Both had a
comparison engine and a UI, the one from Obeo was more advanced
regarding the engine whereas the one from Intalio was more advanced from
an UI perspective. [1] refers to the Intalio product (it was actually
published before EMF compare got created). The presentation of this work
during the Modeling Symposium in 2006 led to the creation of the project
in 2007. At that time it was decided to keep Obeo's engine (which was
relying on the dice coefficient ) and Intalio's UI. The Levenstein
distance was used by Intalio's engine.
EMF compare 1.3 is indeed similar to [4] and leverage the dice
coefficient quite a lot. The matching strategy is quite different in EMF
Compare 2.x but still use the dice coefficient.
In a nutshell :
> I) EMF Compare 1.x and 2.x use the Dice coefficient with bi-grams for
string similarity
That's right
> II) EMF Compare 2.x uses the Longest Common Subsequence to determine
changes in multi-references of EObjects
That's right, and its used for multi-valued attributes too.
> III) a) is wrong/outdated.
It refers to EMF Compare 1.3 (see the URL) and as such is neither wrong
nor outdated but there is no complete description of the 2.x algorithm
on the wiki.
Le 05/07/2013 14:53, Simon a écrit :
Hi,
at the moment I am reverse engineering EMF Compare and I've already
read much material. I think I found some inconsistencies among the
material and want to task if I understand things right.
That are the statements in question:
a) According to [1] EMF Compare uses Levenshtein distance for string
similarity.
b) According to [3] EMF Compare 1.3 is similar to [4]. In [4] the Dice
coefficient (although it is not named explicitly) is used for string
similarity.
After a code review of [2] and [5], I came to the following conclusions:
I) EMF Compare 1.x and 2.x use the Dice coefficient with bi-grams for
string similarity
II) EMF Compare 2.x uses the Longest Common Subsequence to determine
changes in multi-references of EObjects
III) a) is wrong/outdated.
I appreciate if someone can approve my conclusions.
References:
[1]
http://eclipsesummit.org/summiteurope2006/presentations/ESE2006-EclipseModelingSymposium10_EMFCompareUtility.pdf
[2]
http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/tree/plugins/org.eclipse.emf.compare.match/src/org/eclipse/emf/compare/match/internal/statistic/NameSimilarity.java?h=1.3
[3]
http://wiki.eclipse.org/EMF_Compare/FAQ/1.3#What_kind_of_.22strategies.22_use_EMF_compare_.3F
[4] http://ase.cs.uni-due.de/olbib/p54-xing-241.pdf
[5]
http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/tree/plugins/org.eclipse.emf.compare/src/org/eclipse/emf/compare/utils/DiffUtil.java?h=2.1
_______________________________________________
emf-dev mailing list
[email protected]
https://dev.eclipse.org/mailman/listinfo/emf-dev
_______________________________________________
emf-dev mailing list
[email protected]
https://dev.eclipse.org/mailman/listinfo/emf-dev