Bug#961638: proposal: stocat - probabilistic cat
Dear Joey, Stefano suggested the inclusion of his 'stocat' (see below) tool into moreutils. I think the simple script is a great idea and I would like to see it in the moreutils collection. What do you think? Might you consider to adopt it? Kind regards, Nicolas On Wed, Mar 10, 2021 at 02:38:55PM +0100, Stefano Zacchiroli wrote: > On Tue, May 26, 2020 at 11:39:42PM +0200, Stefano Zacchiroli wrote: > > I'm hereby proposing the inclusion of the attached "stocat" utility to > > moreutils. It's like cat, but output lines with a given probability, > > defaulting to 10%. It's very useful for random sampling (and *much* > > more efficient at that than using "shuf" which is unwieldy on very > > large inputs) and, while it can be implemented instead with awk/perl > > oneliners, those oneliners aren't very mnemonic and are error prone. > > Heya, as I haven't heard back about this, but others have asked me about > how to best use stocat, I've now released it as an independent tool > here: > > https://gitlab.com/zacchiro/stocat > > I'm happy to reconsider if/when it gets integrated into moreutils. > > Cheers > -- > Stefano Zacchiroli . z...@upsilon.cc . upsilon.cc/zack . . o . . . o . o > Computer Science Professor . CTO Software Heritage . . . . . o . . . o o > Former Debian Project Leader & OSI Board Director . . . o o o . . . o . > « the first rule of tautology club is the first rule of tautology club » -- epost: nico...@fjasle.eu irc://oftc.net/nsc ↳ gpg: 18ed 52db e34f 860e e9fb c82b 7d97 0932 55a0 ce7f -- frykten for herren er opphav til kunnskap -- signature.asc Description: PGP signature
Bug#961638: proposal: stocat - probabilistic cat
On Tue, May 26, 2020 at 11:39:42PM +0200, Stefano Zacchiroli wrote: > I'm hereby proposing the inclusion of the attached "stocat" utility to > moreutils. It's like cat, but output lines with a given probability, > defaulting to 10%. It's very useful for random sampling (and *much* > more efficient at that than using "shuf" which is unwieldy on very > large inputs) and, while it can be implemented instead with awk/perl > oneliners, those oneliners aren't very mnemonic and are error prone. Heya, as I haven't heard back about this, but others have asked me about how to best use stocat, I've now released it as an independent tool here: https://gitlab.com/zacchiro/stocat I'm happy to reconsider if/when it gets integrated into moreutils. Cheers -- Stefano Zacchiroli . z...@upsilon.cc . upsilon.cc/zack . . o . . . o . o Computer Science Professor . CTO Software Heritage . . . . . o . . . o o Former Debian Project Leader & OSI Board Director . . . o o o . . . o . « the first rule of tautology club is the first rule of tautology club »
Bug#961638: proposal: stocat - probabilistic cat
Package: moreutils Version: 0.63-1+b1 Severity: wishlist Tags: patch upstream I'm hereby proposing the inclusion of the attached "stocat" utility to moreutils. It's like cat, but output lines with a given probability, defaulting to 10%. It's very useful for random sampling (and *much* more efficient at that than using "shuf" which is unwieldy on very large inputs) and, while it can be implemented instead with awk/perl oneliners, those oneliners aren't very mnemonic and are error prone. If desired, it could be extended by adding a reservoir sampling option, to guarantee a selection of exactly K items. Thanks a lot for moreutils! Cheers -- System Information: Debian Release: bullseye/sid APT prefers testing APT policy: (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 5.6.0-1-amd64 (SMP w/8 CPU cores) Kernel taint flags: TAINT_WARN Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US:en (charmap=UTF-8) Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages moreutils depends on: ii libc6 2.30-8 ii libipc-run-perl20200505.0-1 ii libtime-duration-perl 1.21-1 ii libtimedate-perl 2.3200-1 ii perl 5.30.2-1 moreutils recommends no packages. moreutils suggests no packages. -- no debconf information #!/usr/bin/perl =head1 NAME stocat - stochastic cat, selecting lines with uniform probability =head1 SYNOPSIS =over =item B [B<-p>|B<--probability> PROBABILITY] [I|B<->]... =back =head1 DESCRIPTION Concatenate FILE(s) to standard output, but printing each input line to output only with a given probability, defaulting to 0.1 (i.e., 10%). With no FILE or when FILE is B<->, read standard input. =head1 OPTIONS =over 4 =item -p, --probability Output lines with the given probability, specified as a number between 0 (0% probability) and 1 (100% probability). Default: 0.1 (i.e., 10% probability). =back =head1 SEE ALSO L =head1 AUTHOR Copyright 2020 by Stefano Zacchiroli Licensed under the GNU GPL. =cut use Getopt::Long; sub die_usage() { die "Usage: $0 [--probability|-p PROBABILITY] [file|-]\n"; } my $probability = 0.1; if (! GetOptions("probability|p=f" => \$probability)) { die_usage(); } if ($probability < 0 || $probability > 1) { die_usage(); } while (<>) { print $_ if rand() <= $probability; }