Package: moreutils
Version: 0.63-1+b1
Severity: wishlist
Tags: patch upstream

I'm hereby proposing the inclusion of the attached "stocat" utility to
moreutils. It's like cat, but output lines with a given probability, defaulting
to 10%. It's very useful for random sampling (and *much* more efficient at that
than using "shuf" which is unwieldy on very large inputs) and, while it can be
implemented instead with awk/perl oneliners, those oneliners aren't very
mnemonic and are error prone.

If desired, it could be extended by adding a reservoir sampling option, to
guarantee a selection of exactly K items.

Thanks a lot for moreutils!

Cheers

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 5.6.0-1-amd64 (SMP w/8 CPU cores)
Kernel taint flags: TAINT_WARN
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages moreutils depends on:
ii  libc6                  2.30-8
ii  libipc-run-perl        20200505.0-1
ii  libtime-duration-perl  1.21-1
ii  libtimedate-perl       2.3200-1
ii  perl                   5.30.2-1

moreutils recommends no packages.

moreutils suggests no packages.

-- no debconf information
#!/usr/bin/perl

=head1 NAME

stocat - stochastic cat, selecting lines with uniform probability


=head1 SYNOPSIS

=over

=item B<stocat> [B<-p>|B<--probability> PROBABILITY] [I<FILE>|B<->]...

=back


=head1 DESCRIPTION

Concatenate FILE(s) to standard output, but printing each input line to output
only with a given probability, defaulting to 0.1 (i.e., 10%).

With no FILE or when FILE is B<->, read standard input.


=head1 OPTIONS

=over 4

=item -p, --probability

Output lines with the given probability, specified as a number between 0 (0%
probability) and 1 (100% probability). Default: 0.1 (i.e., 10% probability).

=back


=head1 SEE ALSO

L<cat(1)>


=head1 AUTHOR

Copyright 2020 by Stefano Zacchiroli <z...@upsilon.cc>

Licensed under the GNU GPL.

=cut

use Getopt::Long;

sub die_usage() {
    die "Usage: $0 [--probability|-p PROBABILITY] [file|-]\n";
}

my $probability = 0.1;
if (! GetOptions("probability|p=f" => \$probability)) {
    die_usage();
}
if ($probability < 0 || $probability > 1) {
    die_usage();
}

while (<>) {
    print $_ if rand() <= $probability;
}

Reply via email to