On Wed, Jan 7, 2009 at 07:41, Anže Vidmar <anz...@gmail.com> wrote:
> hello!
>
> I have some nasty, non-ascii character in some files that contains php code
> (actually somewhere in my SVN branch). What I want to do here is to
> recursively find all the files that contains a specific non-ascii character
> in the file. And most importantly - i need to know the name of the files
> containing it.
>
> So far, I found a script that looks into a file for non-ascii characters and
> prints this characters in hex:
>
> while (<>) {
>    s/([\x80-\xff])/sprintf "\\x{%02x}",ord($1)/eg;
>    print;
> }
>
> Ok, this is good, the non-ascii character (in hex) that I'm looking for is:
>
> x{ef}\\x{bb}\\x{bf}
>
> The problem here is that I can't run this script to run recursively and I
> don't get the name of the file that actually contains this characters.
>
> I've tried with bash, but since it's standard output, I can't get any
> resault on this. Here is what I've tried:
>
> find |xargs /usr/local/bin/check_for_non-ascii_characters.sh  |grep -l
> 'x{ef}\\x{bb}\\x{bf}'
>
> So, I need a way to recursively find non-ascii characters (a specific
> pattern, mentioned before) in all files and I need the name of the files
> containing it.
>
> It would be enough if I would be able only to see what file contains this
> character set.
>
> Thanks

#!/usr/bin/perl

use strict;
use warnings;

use File::Find;

File::Find::find(
    sub {
        return unless -f;
        #refine further with a return unless /\.php$/ if desired
        open my $fh, "<", $_
            or die "could not open $_";
        while (<$fh>) {
            my $offset = 0;
            for my $char (split //) {
                if (ord $char > 127) {
                    printf "non-ascii char (%04x) in file %s on line
%d position %d:\n%s\n",
                        ord($char), $File::Find::name, $., $offset, $_;
                }
                $offset++;
            }
        }
    },
    @ARGV
);

-- 
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

Reply via email to