Bruno Haible wrote:
I would find it best to introduce an option '--unicode'
to 'wc', that would produce Unicode compliant results, at the cost of
- not following POSIX to the letter,
It'd make sense to have an option. How about a more-general option --words, that
would let the user define what a word is? This option's operand could use ERE
syntax, or a shorthand beginning with '+' for common combinations. For example,
the command:
wc --words='[[:alnum:]]+'
would say that a word consists of the longest contiguous sequence of
alphanumeric characters. And
wc --words='+unicode'
would use the Unicode definition of word, whatever it is.