Re: Help required.....about string/text manipulation

Rob Dixon Sun, 15 Jun 2003 08:34:25 -0700

Anybody?

Rob




Chinku Simon wrote:
>
> > Mohit_jain01 wrote:
> > >
> > > > From: Rob Dixon
> > > >
> > > > Mohit_jain01 wrote:
> > > > >
> > > > > I am facing a problem with text file manipulation with Perl.
> > > > >
> > > > > I have a file with over 2 lac lines of data.
> > > > > I need to find the duplicates(strings) in the file and copy those records 
> > > > > into another file.
> > > > >
> > > > > Is there a function/module  in Perl by which I can read the duplicates in a 
> > > > > file at one go and print them
> > > > > on to another file.
> > > >
> > > > Before we can help you we need to know a little more of your problem.
> > > >
> > > > Are you looking for duplicate lines in the file, or duplicate strings defined
> > > > in some other way? How big is the file you want to read (how many lines
> > > > or strings do you want to compare)?
> > > >
> > > > There are modules which will help you write your program, but exactly
> > > > how you go about it depends on the details of your problem.
> > >
> > > I have a big file containing about 200000 lines. This file basically contains 
> > > some records. A sample of the file is as given
> > > below:
> > >
> > > dn: cn=1148734,ou=Employees,dc=jci,dc=com
> > >
> > > displayname: Herek, Moriah L
> > >
> > > jdirlastfourssn: 2888
> > >
> > > dn: cn=1148735,ou=Employees,dc=jci,dc=com
> > >
> > > displayname: Pelletier, Michael J
> > >
> > > jdirlastfourssn: 8719
> > >
> > > uid: cpellem
> > >
> > [snip data]
> > >
> > > What I need to do is:
> > >
> > > 1. Take the first entry and get the value of the display name and 
> > > jdirlastfourssn attribute.
> > >
> > > 2. Check whether there is another record with the same display name attribute 
> > > value.
> > >        (There cud be multiple records)
> > >
> > >  3. If so then extract both record and write them into another file.
> > >
> > >  4. Delete these duplicate records from the parent file.
> > >
> > > 5. Do that for all records.
> > >
> >
> > I'm not clear whether you mean 200K lines or 200K records (which seem to be mostly
> > 6 lines each except for 'Pelletier, Michael J' which has an additional pair of
> > lines for a 'uid' value. But, even if it were 200K records at about 100 characters
> > each this would be 20Mb, which is well within the capacity of all but the smallest
> > computers these days. This problem is far easier with all the data in memory, so
> > I'll go that way, and if you find it's not working or is too slow we'll think 
> > again.
> >
> > OK, so let me rewrite your algorithm a little.
> >
> > - Read all records into memory
> > - While there are records left
> > -   calculate a 'key' from the display name and serial number
> > -   find all records in the data with a matching key
> > -   if there was only one then print it to parent file, else print them to file 2
> > -   delete them from the list
> >
> > From the top:
> >
> >
> > "Read in all of the records"
> >
> > It looks like all the information starts with a line beginning with 'dn:'. If this
> > is wrong we'll have to change it.
> >
> >   my @records;
> >
> >   while (<DATA>) {
> >     push @records, '' if /^dn:\s+/;
> >     $records[-1] .= $_;
> >   }
> >
> >
> > Before the loop, how about a subroutine which, given one of the multi-line records,
> > will return a key value containing the name and serial number. This picks out the
> > strings and concatenates them with a tab character in between, chosen because it
> > is unlikely to appear in the data itself.
> >
> >   sub keyval {
> >     my $rec = shift;
> >     my ($name, $sn);
> >     ($name) = $rec =~ m/^displayname:\s+(.+)/m;
> >     ($sn) = $rec =~ m/^jdirlastfourssn:\s+(\d+)/m;
> >     join "\t", $name, $sn;
> >   }
> >
> >
> > "While there are records left" Here's the loop, including a couple of lines to
> > remove all non-blank entries from the beginning which will be left by the call
> > to 'delete' that you see in a moment.
> >
> >   while (@records) {
> >     until (exists $records[0]) {
> >       shift @records;
> >     }
> >     :
> >   }
> >
> >
> > "Calculate a 'key' from the display name and serial number"
> >
> >   my $key = keyval($records[0]);
> >
> > "Find all records in the data with a matching key" This call to 'map'
> > returns a list of indices of all the records in the array which have
> > a matching key value. This is bound to include the index zero as the
> > first record matches itself. If there are more then the length of the
> > array will be more than 1.
> >
> >   my $i = 0;
> >   my @slice = map {
> >     my @i = $i++;
> >     defined $_ and keyval($_) eq $key ? @i : ()
> >   } @records;
> >
> >
> > The next two steps together: "If there was only one then print it to
> > parent file, else print them to file 2. Delete them from the list"
> > The 'delete' function usefully returns a list of all the records
> > it deleted, so we can just print the results of deleting the array
> > slice.
> >
> >   if (@slice == 1) {
> >     print PARENT delete @[EMAIL PROTECTED];
> >   }
> >   else {
> >     print FILE2 delete @[EMAIL PROTECTED];
> >   }
> >
> >
> > And you're done. Clearly you need to open filehandles DATA for read
> > and PARENT and FILE2 for write, but the program's there otherwise.
> >
> > Let us know if you need help assembling the kit.
> >
> Hi,
>
> I wud like some help in assembling the kit.
>
> Thanks in Advance




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Help required.....about string/text manipulation

Reply via email to