On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke <mat...@berkeley.edu> wrote: > Hi all, > > So I have an interesting challenge. I want to compare two book chapters, > which I have in plain text format, and find out (a) percentage similarity > and (b) what has changed. > > Some features make this problem different than what seems to be the standard > text-matching problem solvable with e.g. difflib. Here is what I mean: > > * there is no guarantee that single lines from each file will be directly > comparable -- e.g., if a few words are inserted into a sentence, then a > chunk of the sentence will be moved to the next line, then a chunk of that > line moved to the next, etc. > > * Also, there are cases where paragraphs have been moved around, sections > re-ordered, etc. So it can't just be a "linear" match. > > I imagine this kind of thing can't be all that hard in the grand scheme of > things, but I couldn't find an easily applicable solution readily available. > I have advanced beginner python skills but am not quite where I could do > this kind of thing from scratch without some guidance about the likely > functions, libraries etc. to use. > > PS: I am going to have to do this for multiple book chapters so various > software packages, e.g. for windows, are not really usable.
Though not written in Python, wdiff (http://www.gnu.org/software/wdiff/wdiff.html) might be a good starting point. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list