Re: difflib qualm

Gabriel Genellina Wed, 24 Jan 2007 18:27:43 -0800

At Wednesday 24/1/2007 23:05, Sick Monkey wrote:

I am trying to write a python script that will compare 2 files whichcontains names (millions of them).
More specifically, I have 2 files (Files1.txt andFiles2.txt). Files1.txt contains 180 thousand names and Files2.txtcontains 34 million names.
I have a script which will analyze these two files and store theminto 2 different lists (fileList1 and fileList2 respectivly). Ihave imported the diflib library and after the lists are created,matching on the following criteria " " for diflib -> (just the namesthat are similar between the two files).
This works perfectly for hundreds of names but is taking forever formillions of them; thus not really efficient.
Does anyone have any idea on how to get this moreefficient? (speaking of Time and RAM)
Any advice would be greatly appreciated. (NOTE: I have beentrying to study multithreading, but have not really grasp theconcept. So I may need some examples.)

When you say "names" you mean people's names? So you want to match,say, Levenshtein Vladimir to Lebenstain V.? And you only have thenames to match?

Not a good candidate for difflib. See this paper:

Record Linkage: A Machine Learning Approach, A Toolbox, and A DigitalGovernment Web Service (2003)Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, ThanaaM. Ghanem, Ahmed R. Huwait.

http://citeseer.ist.psu.edu/elfeky03record.html

and you can get other pointers from the Levenshtein distance articlein Wikipedia and http://en.wikipedia.org/wiki/Fuzzy_string_searching



--
Gabriel Genellina

Softlab SRL

__________________________________________________Preguntá. Respondé. Descubrí.Todo lo que querías saber, y lo que ni imaginabas,está en Yahoo! Respuestas (Beta).¡Probalo ya!http://www.yahoo.com.ar/respuestas

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: difflib qualm

Reply via email to