Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the
trick, almost perfect.
The original file is the palm database of memo pads. The text is
there, plain. Several mixed control characters were present.
The system I working on is a Fedora linux box. I have no hex utility
installed to make de dump, so I don't know if the ^E is really a ^E.
Anyway it flew away after executing:
open(FILE,"<046.txt") or die;
@text=<FILE>;
close(FILE);
$text= join "",@text;
$text=~s/[\x00-\x1f]//g;
print $text;
open(FILE,">file2.txt") or die;
print FILE $text;
close(FILE);
file2.txt now reads almost perfect (I add the -------------- for
clarity):
-------------------------------------------
ÿÝ.Anodizado
Ultima actualización: 06-Mar-2004 http://www.kr2-egb.com.ar/anodizado.htm
¿Que es el anodizado?
Cuando escuchamos este termino, lo primero que se nos cruza por la
cabeza es el coloreado del aluminio, pues algo de eso tiene, pero en
si el proceso de anodizado es una forma de proteger el aluminio contra
de los agentes atmosféricos. Luego del extruído o decapado, este
material entra en contacto con el aire y forma por si solo una delg..
Some more plain spanish text here ...Volver al inicio
<index.htm> :ð, . <84> .
-------------------------------------------
Except for those chars ÿÝ at the beginning and
the :ð, . <84> at the end, but no bad chars in between.
I add this:
$text=~s/ÿÝ//g;
$text=~s/Ý//g;
$text=~s/ÿ//g;
$text=~s/<84>//g;
But nothing happened. In particular in my vi editor the char <84>
appears in blue, whereas the rest is black.
Any idea what to do with them?
Thank you again. Very helpful so far.
Alejandro
On Oct 1, 2008, at 4:44 PM, [EMAIL PROTECTED] wrote:
hi anonymous --
the first thing to do is to be very clear about exactly what the
characters
are that you are trying to eliminate -- and those you are trying to
keep!.
you do not say what character set you are dealing with -- ascii,
utf8, utf16,
etc., etc. it would be nice to know this also.
it would also be nice to the operating system and perl version you are
working with.
one way to find out about actual characters is to use a hex dump
utility
of some kind. is what displays in my e-mail as ``^E'' (carat-E)
really a
carat character followed by an upper-case E character, or is it a
control-E (ascii 0x05 ``ENQ'')? likewise, is ^@ (carat-@) a
control-@
(ascii 0x00 ``NUL'') character? what about all the whitespace that
surrounds these characters in my e-mail: is that really there?
another important step is to familiarize yourself with regex format --
perlre, perlretut and perlrequick are important here.
one quick point is that the regex expression s/![a-zA-Z][0-9]//g
does
not negate the character classes that follow it: the ``!'' character
is not
special in a regex, it is literally a ``!'', an exclamation mark.
you might
want something like s/[^a-zA-Z0-9]//g instead -- however, this
will also
delete the accented characters you say you want to keep.
if you just want to eliminate ascii control characters, the regex
s/[\x00-\x1f]//g would, i think, do the trick. try something like
perl -i.bak -lpe "s/[\x00-\x1f]//g" input.file
on a COPY (and in a separate directory) of the file you are trying to
fix. (i am assuming you are running windows.)
hth -- bill walters
Looking for simple solutions to your real-life financial challenges?
Check out WalletPop for the latest news and information, tips and
calculators.
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs