On 2015-08-19 23:43, Adam Borowski wrote: > Package: lintian > Version: 2.5.36.1 > Severity: wishlist > Tags: patch > >
Hi, Thanks for the suggestion and the prototype patch. I think it is an interesting proposal and I think we could try it as an experimental check to see if it is a feasible check. I have some minor technical remarks to the patch interleaved to below. It will also need a test case in the t/tests suite. Axel (XTaran) and I are at DebConf if you need help with anything, etc. :) > [...] > > > 0001-New-experimental-tag-text-file-uses-obsolete-encodin.patch > > >>From 902283f122c71c88b968abfc3c778686200c9361 Mon Sep 17 00:00:00 2001 > From: Adam Borowski <[email protected]> > Date: Wed, 19 Aug 2015 23:32:39 +0200 > Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding > > Signed-off-by: Adam Borowski <[email protected]> > --- > checks/files.desc | 11 +++++++++++ > checks/files.pm | 10 +++++++++- > lib/Lintian/Util.pm | 22 ++++++++++++++++++++++ > 3 files changed, 42 insertions(+), 1 deletion(-) > > [...] > diff --git a/checks/files.pm b/checks/files.pm > index b816ed8..0940bbf 100644 > --- a/checks/files.pm > +++ b/checks/files.pm > @@ -27,7 +27,7 @@ use Lintian::Data; > use Lintian::Output qw(warning); > use Lintian::Tags qw(tag); > use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz > - signal_number2name strip normalize_pkg_path); > + signal_number2name strip normalize_pkg_path file_is_non_utf8_text); > use Lintian::SlidingWindow; > > use constant BLOCKSIZE => 16_384; > @@ -1514,6 +1514,14 @@ sub run { > if $info->index($1); > } > > + # ---------------- encoding > + if ( $fname =~ m{^(?:usr/)?s?bin/} Is this already guarded by a "$file->is_regular_file" (or $file->is_file)? Otherwise, this will break horribly with lintian trying to open directories, named pipes, etc. > + or $fname =~ m{\.(?:pm|py|pl|txt)$} Minor nit: Perhaps \.(?:p[myl]|txt(?:\.gz)?)$ * Would also test gzip-compressed text files * p[myl] is slightly more optimial than pm|py|pl > [...] > index 0b8fa5a..f68ea71 100644 > --- a/lib/Lintian/Util.pm > +++ b/lib/Lintian/Util.pm > @@ -62,6 +62,7 @@ BEGIN { > slurp_entire_file > file_is_encoded_in_non_utf8 > is_string_utf8_encoded > + file_is_non_utf8_text > fail > strip > lstrip > @@ -859,6 +860,27 @@ sub file_is_encoded_in_non_utf8 { > return $line; > } > > +=item file_is_non_utf8_text (...) > + > +Both binary files and text files encoded in proper UTF8 give a negative > +answer. > + > +=cut > + > +sub file_is_non_utf8_text { > + my ($file) = @_; > + > + my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : > $file->open; > + my $bad=0; > + while (<$fd>) { I would like to see this use a named variable here (to keep the uses of "$_" down). > + return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/); The $fd might be a pipe to a subprocess (in the open_gz case). It would need a "drain_pipe" call to avoid spurious errors caused by gzip dying by "broken pipe". > [...] >

