Re: finding rows in a large file (22 millions of rows)

R. Joseph Newton Mon, 10 Feb 2003 22:14:47 -0800

Madhu Reddy wrote:

> Hi,
>    I want find a duplicate records in a large file....
> it contains around 22 millions records.....
>
> basically following is my file structure....
>
>  C1     C2     C3   C4
> ------------------------
> 12345 efghij klmno pqrs
> 34567 abnerv oiuuy uyrv


If i were going to handle this many records within my program, I would construct a 
Balanced Tree.
This structure is custom-built for large data sets, but a bit complicated.

The tree is composed of modes carrying arrays with capacities of four or more data 
items apiece.
Each row also has an array of node references, one less is number than the array of 
data items
Each node has a minimum capacity half that of its maximum capacity. [The absolute root 
of the tree
is, of course, an exception to this rule.]
Each of the arrays also has one "extra" element beyond its normal maximum capacity, to 
be
used in the balance-shifting process.

Here is how it works, in rough terms:

When the first node fils to capacity the tree:

Splits its data into two halves.
Moves half the data to a newly created node.
Creates a new root node with references to each of the two half-full nodes.  These two 
nodes are now leaf nodes.

The tree will not grow in height again until all the leaf nodes it can reference are 
filled.
The values within each data array are sorted.  On the leaf nodes, a search would 
simply traverse the array.  On a root node, the search would do a lookahead check to 
see if the value of the next data itemis above that of the search item.  If it is, 
then the search would proceed to the node reference by the corresponding element of 
the pointer [oops, slap me with a wet rutabaga, I said a naughty word!!!] array, and 
search there.

When one node other than the absolute root exceeds capacity, it first attempts to move 
the overflow element onto an adjoining node. If there is excess capacity on that 
nodethere is a simple rotation in of data items to move the extra element.  When there 
is no capacity on the adjoining node the node must split sending it middle data 
element and the pointer to the new node upward to squeeze in
to the arrays on the parent node.

The process of deletion is the converse of insertion, with nodes shifting elements 
sideways until both are at minimum, then merging when one drops below.

It's a devilishly hard structure to build, but when you have the details worked out, 
it totally flies.  It uses the full power of exponential expansion without the 
probabilistic distrotions that can come
with more primitive trees, if data is inserted in the wrong patterns.

For gigabyte-sized data  stores, you can't beat a balanced tree.

Anybody out there build one in Perl?

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: finding rows in a large file (22 millions of rows)

Reply via email to