[GitHub] [commons-text] ameyjadiye commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

GitBox Fri, 08 Mar 2019 22:49:12 -0800

ameyjadiye commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity 
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-471152055
 
 
   @kinow @aherbert 
   I have written code by keeping Wikipedia as standard and researched some 
other libraries from other languages just for reference. all of them are using 
bigrams for calculating similarities. and I personally think that if  we go 
further and use triGram, qurterGram ... nGram the resulting %age would be 
incorrect. we can't use charachter by charachter match i.e uniGram as that will 
also make result bad as ```ab!=ba```
   
   with nGram are we tampering the existing proved algoritham ? if its giving 
better results  than existing algo I'm ok with that, also does someone really 
need it in real world examples ?
   
   Wikipedia says 
   
   > When taken as a string similarity measure, the coefficient may be 
calculated for two strings, x and y using bigrams as follows:[9]
   > 
   >  s=2nt / nx + ny
   > where nt is the number of character bigrams found in both strings, nx is 
the number of bigrams in string x and ny is the number of bigrams in string y. 
For example, to calculate the similarity between:
   > 
   > night
   > nacht
   > We would find the set of bigrams in each word:
   > 
   > {ni,ig,gh,ht}
   > {na,ac,ch,ht}
   > Each set has four elements, and the intersection of these two sets has 
only one element: ht.
   > 
   > Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 
4) = 0.25.
   > 
   
   not sure but are we over engineering similarities with #109 ? let me know if 
there is  practicle use of nGram in real world ? would like to study it more.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [commons-text] ameyjadiye commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

Reply via email to