Dear Margi, Great question and thanks for posting this to the list! :)
You may also want to split your extracted text not just by "\n" but also look to split by perhaps " " to canonical-ize the words. You may even think of an approach for creating words (recall we discussed a method in class for considering N-grams). This should make your process for using commons-lang and edit distance easier. For the OCR help, check out TIKA-93 [1] and the work going on there. Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-93 -----Original Message----- From: Margi Patel <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, March 16, 2014 11:36 AM To: "[email protected]" <[email protected]> Subject: Use of Levenshtein distance to find similar words >Hello Professor Mattmann, > >I have completed the basic requirements of TIKA assignment ( without OCR >quality check) and now I want to go for the extra edit part. I plan to use >Levenshtein distance implemented in apache's commons-lang3-3.1.jar file. > >I tried the following : >--------------------------- >After I extract all of the text from each PDF file, I need to find out >Levenshtein distance between each of the keywords in my set of '11 >keywords' >and the extracted text. >Since the extracted text is a very long string, I thought to split this >text >on new line character("\n"). For each line, I compute the edit distance >keeping the threshold very low. > >However, this does not seem to be the correct approach since the extracted >text contains a good amount of junk characters due to OCR noise and >error. >I need to do some pre-processing on the extracted text first. > >Pointers along the right direction/approach will greatly help. > >Thanks ! >-Margi > > >
