Anybody? Rob
Chinku Simon wrote: > > > Mohit_jain01 wrote: > > > > > > > From: Rob Dixon > > > > > > > > Mohit_jain01 wrote: > > > > > > > > > > I am facing a problem with text file manipulation with Perl. > > > > > > > > > > I have a file with over 2 lac lines of data. > > > > > I need to find the duplicates(strings) in the file and copy those records > > > > > into another file. > > > > > > > > > > Is there a function/module in Perl by which I can read the duplicates in a > > > > > file at one go and print them > > > > > on to another file. > > > > > > > > Before we can help you we need to know a little more of your problem. > > > > > > > > Are you looking for duplicate lines in the file, or duplicate strings defined > > > > in some other way? How big is the file you want to read (how many lines > > > > or strings do you want to compare)? > > > > > > > > There are modules which will help you write your program, but exactly > > > > how you go about it depends on the details of your problem. > > > > > > I have a big file containing about 200000 lines. This file basically contains > > > some records. A sample of the file is as given > > > below: > > > > > > dn: cn=1148734,ou=Employees,dc=jci,dc=com > > > > > > displayname: Herek, Moriah L > > > > > > jdirlastfourssn: 2888 > > > > > > dn: cn=1148735,ou=Employees,dc=jci,dc=com > > > > > > displayname: Pelletier, Michael J > > > > > > jdirlastfourssn: 8719 > > > > > > uid: cpellem > > > > > [snip data] > > > > > > What I need to do is: > > > > > > 1. Take the first entry and get the value of the display name and > > > jdirlastfourssn attribute. > > > > > > 2. Check whether there is another record with the same display name attribute > > > value. > > > (There cud be multiple records) > > > > > > 3. If so then extract both record and write them into another file. > > > > > > 4. Delete these duplicate records from the parent file. > > > > > > 5. Do that for all records. > > > > > > > I'm not clear whether you mean 200K lines or 200K records (which seem to be mostly > > 6 lines each except for 'Pelletier, Michael J' which has an additional pair of > > lines for a 'uid' value. But, even if it were 200K records at about 100 characters > > each this would be 20Mb, which is well within the capacity of all but the smallest > > computers these days. This problem is far easier with all the data in memory, so > > I'll go that way, and if you find it's not working or is too slow we'll think > > again. > > > > OK, so let me rewrite your algorithm a little. > > > > - Read all records into memory > > - While there are records left > > - calculate a 'key' from the display name and serial number > > - find all records in the data with a matching key > > - if there was only one then print it to parent file, else print them to file 2 > > - delete them from the list > > > > From the top: > > > > > > "Read in all of the records" > > > > It looks like all the information starts with a line beginning with 'dn:'. If this > > is wrong we'll have to change it. > > > > my @records; > > > > while (<DATA>) { > > push @records, '' if /^dn:\s+/; > > $records[-1] .= $_; > > } > > > > > > Before the loop, how about a subroutine which, given one of the multi-line records, > > will return a key value containing the name and serial number. This picks out the > > strings and concatenates them with a tab character in between, chosen because it > > is unlikely to appear in the data itself. > > > > sub keyval { > > my $rec = shift; > > my ($name, $sn); > > ($name) = $rec =~ m/^displayname:\s+(.+)/m; > > ($sn) = $rec =~ m/^jdirlastfourssn:\s+(\d+)/m; > > join "\t", $name, $sn; > > } > > > > > > "While there are records left" Here's the loop, including a couple of lines to > > remove all non-blank entries from the beginning which will be left by the call > > to 'delete' that you see in a moment. > > > > while (@records) { > > until (exists $records[0]) { > > shift @records; > > } > > : > > } > > > > > > "Calculate a 'key' from the display name and serial number" > > > > my $key = keyval($records[0]); > > > > "Find all records in the data with a matching key" This call to 'map' > > returns a list of indices of all the records in the array which have > > a matching key value. This is bound to include the index zero as the > > first record matches itself. If there are more then the length of the > > array will be more than 1. > > > > my $i = 0; > > my @slice = map { > > my @i = $i++; > > defined $_ and keyval($_) eq $key ? @i : () > > } @records; > > > > > > The next two steps together: "If there was only one then print it to > > parent file, else print them to file 2. Delete them from the list" > > The 'delete' function usefully returns a list of all the records > > it deleted, so we can just print the results of deleting the array > > slice. > > > > if (@slice == 1) { > > print PARENT delete @[EMAIL PROTECTED]; > > } > > else { > > print FILE2 delete @[EMAIL PROTECTED]; > > } > > > > > > And you're done. Clearly you need to open filehandles DATA for read > > and PARENT and FILE2 for write, but the program's there otherwise. > > > > Let us know if you need help assembling the kit. > > > Hi, > > I wud like some help in assembling the kit. > > Thanks in Advance -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]