Dear Margi,

Great question and thanks for posting this to the list! :)

You may also want to split your extracted text not just by "\n" but
also look to split by perhaps " " to canonical-ize the words. You may
even think of an approach for creating words (recall we discussed a method
in class for considering N-grams). This should make your process for
using commons-lang and edit distance easier. For the OCR help, check
out TIKA-93 [1] and the work going on there.

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-93





-----Original Message-----
From: Margi Patel <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, March 16, 2014 11:36 AM
To: "[email protected]" <[email protected]>
Subject: Use of Levenshtein distance to find similar words

>Hello Professor Mattmann,
>
>I have completed the basic requirements of TIKA assignment ( without OCR
>quality check) and now I want to go for the extra edit part. I plan to use
>Levenshtein distance implemented in apache's commons-lang3-3.1.jar file.
>
>I tried the following :
>---------------------------
>After I extract all of the text from each PDF file, I need to find out
>Levenshtein distance between each of the keywords in my set of '11
>keywords'
>and the extracted text.
>Since the extracted text is a very long string, I thought to split this
>text
>on new line character("\n"). For each line, I compute the edit distance
>keeping the threshold very low.
>
>However, this does not seem to be the correct approach since the extracted
>text contains a good amount of junk  characters due to OCR noise and
>error.
>I need to do some pre-processing on the extracted text first.
>
>Pointers along the right direction/approach will greatly help.
>
>Thanks !
>-Margi
>
>
>


Reply via email to