At Wednesday 24/1/2007 23:05, Sick Monkey wrote:
I am trying to write a python script that will compare 2 files which
contains names (millions of them).
More specifically, I have 2 files (Files1.txt and
Files2.txt). Files1.txt contains 180 thousand names and Files2.txt
contains 34 million names.
I have a script which will analyze these two files and store them
into 2 different lists (fileList1 and fileList2 respectivly). I
have imported the diflib library and after the lists are created,
matching on the following criteria " " for diflib -> (just the names
that are similar between the two files).
This works perfectly for hundreds of names but is taking forever for
millions of them; thus not really efficient.
Does anyone have any idea on how to get this more
efficient? (speaking of Time and RAM)
Any advice would be greatly appreciated. (NOTE: I have been
trying to study multithreading, but have not really grasp the
concept. So I may need some examples.)
When you say "names" you mean people's names? So you want to match,
say, Levenshtein Vladimir to Lebenstain V.? And you only have the
names to match?
Not a good candidate for difflib. See this paper:
Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital
Government Web Service (2003)
Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, Thanaa
M. Ghanem, Ahmed R. Huwait.
http://citeseer.ist.psu.edu/elfeky03record.html
and you can get other pointers from the Levenshtein distance article
in Wikipedia and http://en.wikipedia.org/wiki/Fuzzy_string_searching
--
Gabriel Genellina
Softlab SRL
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
--
http://mail.python.org/mailman/listinfo/python-list