Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the trick, almost perfect. The original file is the palm database of memo pads. The text is there, plain. Several mixed control characters were present. The system I working on is a Fedora linux box. I have no hex utility installed to make de dump, so I don't know if the ^E is really a ^E.
Anyway it flew away after executing:

open(FILE,"<046.txt") or die;
@text=<FILE>;
close(FILE);

$text= join "",@text;

$text=~s/[\x00-\x1f]//g;
print $text;

open(FILE,">file2.txt") or die;
print FILE $text;
close(FILE);

file2.txt now reads almost perfect (I add the -------------- for clarity):
-------------------------------------------
                             ÿÝ.Anodizado
Ultima actualización: 06-Mar-2004 http://www.kr2-egb.com.ar/anodizado.htm ¿Que es el anodizado? Cuando escuchamos este termino, lo primero que se nos cruza por la cabeza es el coloreado del aluminio, pues algo de eso tiene, pero en si el proceso de anodizado es una forma de proteger el aluminio contra de los agentes atmosféricos. Luego del extruído o decapado, este material entra en contacto con el aire y forma por si solo una delg.. Some more plain spanish text here ...Volver al inicio <index.htm> :ð, . <84> .
-------------------------------------------

Except for those chars ÿÝ at the beginning and the :ð, . <84> at the end, but no bad chars in between.
I add this:
$text=~s/ÿÝ//g;
$text=~s/Ý//g;
$text=~s/ÿ//g;
$text=~s/<84>//g;
But nothing happened. In particular in my vi editor the char <84> appears in blue, whereas the rest is black.

Any idea what to do with them?

Thank you again. Very helpful so far.

Alejandro



On Oct 1, 2008, at 4:44 PM, [EMAIL PROTECTED] wrote:

hi anonymous --

the first thing to do is to be very clear about exactly what the characters are that you are trying to eliminate -- and those you are trying to keep!.

you do not say what character set you are dealing with -- ascii, utf8, utf16,
etc., etc.   it would be nice to know this also.

it would also be nice to the operating system and perl version you are
working with.

one way to find out about actual characters is to use a hex dump utility of some kind. is what displays in my e-mail as ``^E'' (carat-E) really a
carat character followed by an upper-case E character, or is it a
control-E (ascii 0x05 ``ENQ'')? likewise, is ^@ (carat-@) a control-@
(ascii 0x00 ``NUL'') character?   what about all the whitespace that
surrounds these characters in my e-mail: is that really there?

another important step is to familiarize yourself with regex format --
perlre, perlretut and perlrequick are important here.

one quick point is that the regex expression s/![a-zA-Z][0-9]//g does not negate the character classes that follow it: the ``!'' character is not special in a regex, it is literally a ``!'', an exclamation mark. you might want something like s/[^a-zA-Z0-9]//g instead -- however, this will also
delete the accented characters you say you want to keep.

if you just want to eliminate ascii control characters, the regex
s/[\x00-\x1f]//g  would, i think, do the trick.   try something like

perl  -i.bak  -lpe "s/[\x00-\x1f]//g"  input.file

on a COPY (and in a separate directory) of the file you are trying to
fix.   (i am assuming you are running windows.)

hth -- bill walters




Looking for simple solutions to your real-life financial challenges? Check out WalletPop for the latest news and information, tips and calculators.
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to