you are right. k is the edit distance we are searching for and a critical parameter. In short you can say- k represents how much error(in terms of edit-distance) you want to tolerate for between document word 'w' and your suggestion. since our data structure can answer queries for e.g. "Find all words with k<=5)" I think we can do better as loading the tree and searching could be costly so, instead of repeatedly firing queries many times for k=1, 2, 3,... i think it's better to do it like:
1. for a given document word 'w': you could start k = 0 (for exact matching, i.e. if w is present in dictionary or not) if returned list.size() =1 then its' a valid word else, if it's NULL fire a query for k=2. >From the function return a list of all dictionary words which are *<=2*distance from 'w' and return a sorted list based on edit distance. sometimes returned list could be large so you need to filter out the Best possible Suggestions for 'w'. like- you might wanna give preference to those words which were 1 distance away than 2. and in that - those edits which have the edited 'alphabet close to the mispelled one'... like the example- w = REDT [REST is more likely than RENT as 'S' appears closer to 'D' on keyboard than 'N'] etc. or they sound same based on some phoneme model etc. 2. if for k=2 returned list was NULL, you can query for k=5, and check if there are any words with edit-distance *<=5*., again returned list could possibly be NULL as well. you might want to limit your search for k (say 5). e.g. if document contains w = "ijljhflkjjiulgihh" It's highly unlikely that your dictionary will contain any word closer to this (unless ur dictionary contains crazy volcano names from iceland): so for cases like these, after k=5 you can return "No Suggestion". It's actually experimentative. you could try any other way also but this way you can limit your no. of queries/per word to 2. A correction: I realize previously I've interchangeably used teh name 'KDTree' and 'bk-tree', both are metric trees but what I really meant was a 'bkTree'. where, a node has arbitrary no. of children and the parent-child edge represents the corresponding Levenshtein distance between them. The basic idea here is to store your dictionary in a data-structure whcih facilitates searching of words based on their edit-distances. On 27 October 2012 22:40, payal gupta <[email protected]> wrote: > the question mentioned is as it is....i just copy pasted it here. > @saurabh thanx for the explainaton of the cube problem i guess that is an > appropriate soln for the question. > and for the other question on detection of typos and suggestion i would > like to know to know what 'k' in your explaination stands for?how are the > values allocated to it ? should it be for each wrong word not mentioned in > the dictionary we got to check if the word exists with edit distance equal > to 1 in dictioanry > and so on until we get the correct word??? > > > > > on Sat, Oct 27, 2012 at 8:12 AM, Saurabh Kumar <[email protected]>wrote: > >> could you please share the link? coz at first glance a Trie looks like a >> bad choice for this task. >> >> I'd go with the Levenshtein distance and a kd-tree. >> First implement the Levenshtein distance algorithm to calculate the edit >> distance of two strings. >> Second, since Levenshtein distance qualifies as a metric space we can use >> a metric tree like BK-tree to populate it with our dictionary. >> Choose a random word from dictionary as a root and subsequently insert >> dictionary words(picking them up randomly) into the tree. >> A node has arbitrary no. of children. The parent-child edge represents >> the corresponding Levenshtein distance between them. >> >> Building the tree is one time process. Once the tree is built we can >> devise a way to serialize it and store it. >> >> Using this tree we can find all the words with edit-distance less than or >> equal to, say k. >> Lets, define a function call in Tree class as: List KDTreeSearch(s, k); >> which searches for all strings s' in the tree such that |s-s'| <= k i.e. >> all strings which are less than or equal to an edit distance of k. >> Searching: >> Start with the Root and calculate the edit-distance of s from root. If >> its', say d then we know exactly which children we need to descend to in >> order to find the words with distance <=k. >> >> Looking for typos: >> Scan the document and for each word 'w' make a call: list = >> KDTreeSearch(w, 0); >> if, list.size() = 1. //We have the word in dictionary. >> else, list = KDTreeSearch(w, 2); // searching for all words with edit >> distance of 2 from w >> >> returned 'list' can sometimes be large, we can subsequently filter it out >> by narrowing down our definition of 'typos' >> e.g. for typo w = REDT [REST is more likely than RENT] or maybe some >> Phoneme model etc.... you should discuss this at length with the >> interviewer. >> >> On 27 October 2012 07:03, Raghavan <[email protected]> wrote: >> >>> By any chance did you read the new blog post by Gayle Laakmaan.. >>> >>> I guess to detect typos we can use some sort of Trie implementation.. >>> >>> >>> On Fri, Oct 26, 2012 at 7:50 PM, payal gupta <[email protected]>wrote: >>> >>>> >>>> Given a cube with sides length n, write code to print all possible >>>> paths from the center to the surface. >>>> Thanx in advance. >>>> >>>> >>>> Regards, >>>> PAYAL GUPTA, >>>> NIT-B. >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Algorithm Geeks" group. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msg/algogeeks/-/ZaItRf_9A_IJ. >>>> >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]. >>>> For more options, visit this group at >>>> http://groups.google.com/group/algogeeks?hl=en. >>>> >>> >>> >>> >>> -- >>> Thanks and Regards, >>> Raghavan KL >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Algorithm Geeks" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/algogeeks?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Algorithm Geeks" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/algogeeks?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "Algorithm Geeks" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/algogeeks?hl=en. > -- You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/algogeeks?hl=en.
