At Thursday 25/1/2007 21:49, Larry Bates wrote:
Gabriel Genellina wrote:
> At Wednesday 24/1/2007 23:05, Sick Monkey wrote:
>
>> I am trying to write a python script that will compare 2 files which
>> contains names (millions of them).
>>
>> More specifically, I have 2 files (Files1.txt and Files2.txt).
>> Files1.txt contains 180 thousand names and Files2.txt contains 34
>> million names.

Put the big list of names in a database and create soundex keys for the names
and make the soundex keys an index so you can search quickly.  Databases
are really good at storing data that is searchable via an index. If you REALLY
need speed you can consider an in-memory database.

Create soundex keys for each name in your small list and query the database
with this key into the table in the DB that is indexed on soundex keys.
If you get a hit, the key is sufficiently "alike" to be a candidate.  I'll
leave the remainder to you.  Perhaps there is other information that will
help determine if there is a match?

Soundex is only good for English words, and it's almost useless for non-English names, so it must be used with caution if used at all.


--
Gabriel Genellina
Softlab SRL

        

        
                
__________________________________________________ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to