On 02/21/2013 03:42 PM, Assaf Gordon wrote:
Hello Pádraig,
Pádraig Brady wrote, On 02/20/2013 08:47 PM:
On 02/20/2013 06:44 PM, Assaf Gordon wrote:
Hello,
Attached is a suggestion for "--group" option in uniq, as discussed here:
http://lists.gnu.org/archive/html/coreutils/2011-03/msg00000.html
http://lists.gnu.org/archive/html/coreutils/2012-03/msg00052.html
The patch adds two parameters:
--group=[method] separate each unique line (whether duplicated or not)
with a marker.
method={none,separate(default),prepend,append,both)
--group-separator=SEP with --group, separates group using SEP
(default: empty line)
--group-sep is probably overkill.
I'd just use \n or \0 if -z specified.
OK.
As for separation methods I'd just go with what we have for
--all-repeated (but remove 'none' which wouldn't be useful with --group),
as we've never had requests for anything else. so:
--group={prepend, separate(default)}
I'd like to have at least "append" or "both", for the added convenience of
downstream analysis.
It's obviously a "nice-to-have" and not "must-have" feature, and can be
implemented in other ways, but knowing that there will always be a terminating marker *after* a
group (even the last group) makes downstream processing code simpler.
Typical example:
$ cat INPUT | uniq --group=append | \
awk '$0!="" { ## item in the group, collect it }
$0=="" { ## end of group, do something }'
Without the final group marker, any downstream code will require two points of
"group processing": when a marker is found, and at EOF.
Something like:
$ cat INPUT | uniq --group=append | \
awk '$0!="" { ## item in the group, collect it }
$0=="" { ## end of group, do something }
END { ## end of last group, do something, duplicated code }'
Similar reason for having "both", as it ensures there I can put any special
initialization code in the group-marker case, and doesn't need to duplicate it in a
separate 'BEGIN{}' clause (Of course, this doesn't have to be awk - can be
perl/python/ruby/whatever that will do downstream processing).
I realize it's not a "make-or-break" feature - but if we're trying to make text
processing easier, I believe "append/both" makes it even easier.
OK good arguments. Thanks.
Let's keep all apart from 'none' so.
So on to operation...
And it behaves "as expected":
===
$ printf "a\na\na\nb\nc\nc\n" | ./src/uniq --group-sep="--" --group=separate
The above isn't that useful and could be done with sed.
I assume you're specifically referring to the "group-sep" part - then OK.
Actually I was referring to the fact that in your example
--group didn't output all entries by default.
If it only output unique entries then you can separate with:
uniq | sed 'G' # (note sed also supports -z)
uniq | sed '$q;G'
So `uniq --group` should output all items by default I think.
Supporting -u or -d with --group wouldn't be useful either really.
It's probably most consistent to just disallow those combinations.
Just to be clear on the reasoning: because with "-u" and "-d", each *line* is
implicitly a separate group, there's no apparent utility for an end-of-group marker.
Right.
I guess it's true from a technical POV - but again, for downstream analysis
convenience it's nice to have a fixed end-of-group marker.
I could use the same downstream script (which expects end-of-group markers) with uniq, whether I
used "-d" or "-u" or nothing at all.
But what's the point in such processing if there is only ever going
to be a single line in each group?
thanks,
Pádraig.