Package: moreutils Version: 0.63-1+b1 Severity: wishlist Tags: patch upstream
I'm hereby proposing the inclusion of the attached "stocat" utility to moreutils. It's like cat, but output lines with a given probability, defaulting to 10%. It's very useful for random sampling (and *much* more efficient at that than using "shuf" which is unwieldy on very large inputs) and, while it can be implemented instead with awk/perl oneliners, those oneliners aren't very mnemonic and are error prone. If desired, it could be extended by adding a reservoir sampling option, to guarantee a selection of exactly K items. Thanks a lot for moreutils! Cheers -- System Information: Debian Release: bullseye/sid APT prefers testing APT policy: (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 5.6.0-1-amd64 (SMP w/8 CPU cores) Kernel taint flags: TAINT_WARN Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US:en (charmap=UTF-8) Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages moreutils depends on: ii libc6 2.30-8 ii libipc-run-perl 20200505.0-1 ii libtime-duration-perl 1.21-1 ii libtimedate-perl 2.3200-1 ii perl 5.30.2-1 moreutils recommends no packages. moreutils suggests no packages. -- no debconf information
#!/usr/bin/perl =head1 NAME stocat - stochastic cat, selecting lines with uniform probability =head1 SYNOPSIS =over =item B<stocat> [B<-p>|B<--probability> PROBABILITY] [I<FILE>|B<->]... =back =head1 DESCRIPTION Concatenate FILE(s) to standard output, but printing each input line to output only with a given probability, defaulting to 0.1 (i.e., 10%). With no FILE or when FILE is B<->, read standard input. =head1 OPTIONS =over 4 =item -p, --probability Output lines with the given probability, specified as a number between 0 (0% probability) and 1 (100% probability). Default: 0.1 (i.e., 10% probability). =back =head1 SEE ALSO L<cat(1)> =head1 AUTHOR Copyright 2020 by Stefano Zacchiroli <z...@upsilon.cc> Licensed under the GNU GPL. =cut use Getopt::Long; sub die_usage() { die "Usage: $0 [--probability|-p PROBABILITY] [file|-]\n"; } my $probability = 0.1; if (! GetOptions("probability|p=f" => \$probability)) { die_usage(); } if ($probability < 0 || $probability > 1) { die_usage(); } while (<>) { print $_ if rand() <= $probability; }