Re: Coverting a big flat file

Saravana Kumar Thu, 17 Aug 2006 00:08:28 -0700

Rob Dixon wrote:

> Saravana Kumar wrote:
>  > John W. Krahn wrote:
>  >
>  >>Saravana Kumar wrote:
>  >>
>  >>>I am new to the list and newbie in perl.
>  >>>
>  >>>I have a big flat file(100G). The file was supposed to be in a single
>  >>>line but many of records(as it has ^M). There are also ^@ and tabs in
>  >>>between.
>  >>>
>  >>>I want to first replace the control characters and tabs with space.
>  >>>
>  >>>I tried this s/[[:cntrl:]\t]/ /g.
>  >>
>  >>The [:cntrl:] character class includes the "\t" character.
>  >>
>  >>>After replacing the above said characters
>  >>>with space i have to insert \n after each 1000th character.
>  >>>
>  >>>But the program hangs after reading about 24G( 1/4th of the file).
>  >>>
>  >>>I thought of reading the file character by character, check if the
>  >>>character is ^M||^@||\t. If true replace with the space and write the
>  >>>ouput else
>  >>>simply write the output. I have to  keep track of the count of
>  >>>characters so as to insert \n after each 1000th character.
>  >>>
>  >>>Will the above work or is there any other(simple) way to do this?( or
>  >>>should i just move on to C?)
>  >>>
>  >>>I am not sure why my first program hang(i ran the program in a machine
>  >>>with 2G RAM).
>  >>
>  >>You can do what you want if you set the Input Record Separator to read
>  >>1000 bytes at a time:
>  >>
>  >>$/ = \1000;
>  >>while ( <FILE> ) {
>  >>    s/[[:cntrl:]]/ /g;
>  >>    print "$_\n";
>  >>    }
>  >
>  > Thanks John. That did the trick. I ran the above script with my input
>  > file and redirected the output to another file. Since it is creating a
>  > new file i was wondering whether i can do the changes in the same file
>  > ie., read 1000 characters, do the replacement and write the output to
>  > the same file. This will reduce the disk space used(since the file i
>  > have is 100G).
> 
> That is like preparing an apple pie while it is in the oven to save on
> kitchen space. You can't easily do it because each of your new records is
> one character longer than the original record and you would be overwriting
> data you hadn't processed yet. It is possible, in the sense that you could
> make sure that all the data is read from the file and held elsewhere (in
> memory or in a temporary file) before it is overwritten, but it wouldn't
> be a simple piece of code to get working correctly. In any case it is a
> bad idea because if you have a failure of any sort part-way through
> processing then your original data is then lost and you have no second
> chance. If the people you are working for expect to have files of this
> size and haven't allowed for storage space for several of them at once
> then you need to have a word with them about storage planning. You need a
> new disk drive: $100 will buy you around 300GB these days and that doesn't
> buy enough of your time to write clever software to cope with the lack of
> disk space.
> 
> Cheers,
> 
> Rob
> 
I have enough space in the HDD to store more files but this "idea" came to
me just as a thought. I missed the part that adding "\n" will actually
overwrite the first character in the next record, which i haven't read at
all. I am going ahead with the same method( redirecting the output to new
file) so as to save the coding time. Not to mention that i cant loose any
data in that file.


Thanks! for all who replied to my queries. Thanks! for the time spent.

Regds,
SK


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Coverting a big flat file

Reply via email to