I'd think you want to index the table. Since you are doing a search on an un-index table for each word, that overhead will be more than creating an entry in an index table. Entering an index entry will require the routine to determine where the entry should be but it will be looking for that in a sorted list and so it will be able to determine if the entry exist or not fairly quickly, whereas the search on the unsorted list requires a search of all entries 340,000 and counting (a binary search would only require 19 comparisons ln N, sure there will still be some overhead on insert and the occasional re indexing)


Evan Morris wrote:

This is more a process question that is not specific to PHP, but maybe
someone can help me.

I have a text file containing strings. The text file is pretty massive,
about 895 MB. I need to load the words in the text file into the database in
such a way that there is a single occurrence of each word in the table. What
this means is the script has to read each word in the file, check if the
word is in the table, and if not, add it. Obviously as the table grows, the
rate of execution drops because the number of checks that have to be made
increases as more words are added to the table.

Does anyone know of an efficient way to do this? There are over a million
lines in the file, and my script is at about 340 000 having run overnight. I
need to get the data into the table much faster than that.

I considered hashing the file in memory (ie, doing the uniqueness checking
in an associative array) but am not sure how much RAM that would consume and
I don't want to crash the server.

I have also not indexed the table because my gut feel was that the overhead
of constantly updating an indexed table would cancel out the gain of
checking an indexed table. Is that true?

Any and all help, comments, etc, appreciated as always.


