Madhu Reddy wrote: > Hi, > I want find a duplicate records in a large file.... > it contains around 22 millions records..... > > basically following is my file structure.... > > C1 C2 C3 C4 > ------------------------ > 12345 efghij klmno pqrs > 34567 abnerv oiuuy uyrv
If i were going to handle this many records within my program, I would construct a Balanced Tree. This structure is custom-built for large data sets, but a bit complicated. The tree is composed of modes carrying arrays with capacities of four or more data items apiece. Each row also has an array of node references, one less is number than the array of data items Each node has a minimum capacity half that of its maximum capacity. [The absolute root of the tree is, of course, an exception to this rule.] Each of the arrays also has one "extra" element beyond its normal maximum capacity, to be used in the balance-shifting process. Here is how it works, in rough terms: When the first node fils to capacity the tree: Splits its data into two halves. Moves half the data to a newly created node. Creates a new root node with references to each of the two half-full nodes. These two nodes are now leaf nodes. The tree will not grow in height again until all the leaf nodes it can reference are filled. The values within each data array are sorted. On the leaf nodes, a search would simply traverse the array. On a root node, the search would do a lookahead check to see if the value of the next data itemis above that of the search item. If it is, then the search would proceed to the node reference by the corresponding element of the pointer [oops, slap me with a wet rutabaga, I said a naughty word!!!] array, and search there. When one node other than the absolute root exceeds capacity, it first attempts to move the overflow element onto an adjoining node. If there is excess capacity on that nodethere is a simple rotation in of data items to move the extra element. When there is no capacity on the adjoining node the node must split sending it middle data element and the pointer to the new node upward to squeeze in to the arrays on the parent node. The process of deletion is the converse of insertion, with nodes shifting elements sideways until both are at minimum, then merging when one drops below. It's a devilishly hard structure to build, but when you have the details worked out, it totally flies. It uses the full power of exponential expansion without the probabilistic distrotions that can come with more primitive trees, if data is inserted in the wrong patterns. For gigabyte-sized data stores, you can't beat a balanced tree. Anybody out there build one in Perl? Joseph -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]