Re: [PATCH]: uniq: add "--group" option

Pádraig Brady Thu, 21 Feb 2013 08:11:41 -0800

On 02/21/2013 03:42 PM, Assaf Gordon wrote:

Hello Pádraig,


Pádraig Brady wrote, On 02/20/2013 08:47 PM:

On 02/20/2013 06:44 PM, Assaf Gordon wrote:

Hello,

Attached is a suggestion for "--group" option in uniq, as discussed here:
    http://lists.gnu.org/archive/html/coreutils/2011-03/msg00000.html
    http://lists.gnu.org/archive/html/coreutils/2012-03/msg00052.html

The patch adds two parameters:
        --group=[method]  separate each unique line (whether duplicated or not)
                          with a marker.
                          method={none,separate(default),prepend,append,both)
        --group-separator=SEP   with --group, separates group using SEP
                          (default: empty line)


--group-sep is probably overkill.
I'd just use \n or \0 if -z specified.

OK.

As for separation methods I'd just go with what we have for
--all-repeated (but remove 'none' which wouldn't be useful with --group),
as we've never had requests for anything else. so:
--group={prepend, separate(default)}


I'd like to have at least "append" or "both", for the added convenience of 
downstream analysis.
It's obviously a "nice-to-have" and not "must-have" feature, and can be 
implemented in other ways, but knowing that there will always be a terminating marker *after* a 
group (even the last group) makes downstream processing code simpler.

Typical example:
  $ cat INPUT | uniq --group=append | \
       awk '$0!="" { ## item in the group, collect it }
            $0=="" { ## end of group, do something }'

Without the final group marker, any downstream code will require two points of 
"group processing": when a marker is found, and at EOF.
Something like:

  $ cat INPUT | uniq --group=append | \
       awk '$0!="" { ## item in the group, collect it }
            $0=="" { ## end of group, do something }
            END { ## end of last group, do something, duplicated code }'

Similar reason for having "both", as it ensures there I can put any special 
initialization code in the group-marker case, and doesn't need to duplicate it in a 
separate 'BEGIN{}' clause (Of course, this doesn't have to be awk - can be 
perl/python/ruby/whatever that will do downstream processing).

I realize it's not a "make-or-break" feature - but if we're trying to make text 
processing easier, I believe "append/both" makes it even easier.


OK good arguments. Thanks.
Let's keep all apart from 'none' so.

So on to operation...

And it behaves "as expected":
===
$ printf "a\na\na\nb\nc\nc\n" | ./src/uniq --group-sep="--" --group=separate


The above isn't that useful and could be done with sed.

I assume you're specifically referring to the "group-sep" part - then OK.


Actually I was referring to the fact that in your example
--group didn't output all entries by default.
If it only output unique entries then you can separate with:

uniq | sed 'G'     # (note sed also supports -z)
uniq | sed '$q;G'

So `uniq --group` should output all items by default I think.

Supporting -u or -d with --group wouldn't be useful either really.
It's probably most consistent to just disallow those combinations.


Just to be clear on the reasoning: because with "-u" and "-d", each *line* is 
implicitly a separate group, there's no apparent utility for an end-of-group marker.


Right.

I guess it's true from a technical POV - but again, for downstream analysis 
convenience it's nice to have a fixed end-of-group marker.
I could use the same downstream script (which expects end-of-group markers) with uniq, whether I 
used "-d" or "-u" or nothing at all.


But what's the point in such processing if there is only ever going
to be a single line in each group?

thanks,
Pádraig.

Re: [PATCH]: uniq: add "--group" option

Reply via email to