wc: expand help of '-L' (and a question)

Assaf Gordon Fri, 24 Apr 2015 19:38:47 -0700

Hello,

Would you be willing to add the following patch, mentioning tab-expansion and 
multibyte counting of '-L'
in the "--help" screen, and the manual?
Currently this is mentioned only in one sentence at the end of a long 
paragraph, and is easily missed.
My wording could be improved, but I hope this will help prevent confusion with 
'wc -L' output.


Somewhat related:
I seem to get unexpected result with '-L' when forcing C locale.
Perhaps I'm doing something wrong, or there's more intricate details of '-L' ?

# This is a Unicode Character 'BLACK HEART SUIT' (U+2665)
$ printf "\xe2\x99\xa5\n"

# counting characters with UTF-8 locale is 1,
# Counting bytes is 3,
# longest line is 1 - as expected:
$ printf "\xe2\x99\xa5" | LC_ALL=en_US.UTF-8 wc -cmL
      1       3       1


# using C locale, characters=bytes=3,
# but longest line is 0 ?
$ printf "\xe2\x99\xa5" | LC_ALL=C wc -cmL
      3       3       0

This could be because of wc.c line 492, where "isprint" is called on each byte 
(e.g. isprint('\xe2') is false),
and so these characters are not counted at all?

thanks,
 - assaf

>From 74b3d15948a86dd1aaff13529d9e7a62417e438f Mon Sep 17 00:00:00 2001
From: Assaf Gordon <[email protected]>
Date: Fri, 24 Apr 2015 22:18:41 -0400
Subject: [PATCH] wc: expand usage text of '-L' option

* src/wc.c: usage() mention tab-expansion and multibyte counting.
* doc/coreutils.texi: mention tab-expansion and multibyte counting under
  '-L' option, and provide examples.
---
 doc/coreutils.texi | 12 ++++++++++++
 src/wc.c           |  3 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 51d96b4..2e9d33c 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3594,6 +3594,18 @@ Print only the newline counts.
 @opindex -L
 @opindex --max-line-length
 Print only the maximum line lengths.
+Tab characters are assumed to align to every 8th position.
+Depending on the current locale, multibyte characters might be counted as
+consuming one character.
+
+For example, a 3-bytes UTF-8 character is counted as one character,
+and a tab is aligned to the nearest 8th column position:
+@example
+$ printf "\xe2\x99\xa5\n" | LC_ALL=en_US.UTF-8 wc -L
+1
+$ printf "a\tb\n" | wc -L
+9
+@end example
 
 @macro filesZeroFromOption{cmd,withTotalOption,subListOutput}
 @item --files0-from=@var{file}
diff --git a/src/wc.c b/src/wc.c
index fe73d2c..5955aaf 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -129,7 +129,8 @@ the following order: newline, word, character, byte, maximum line length.\n\
       --files0-from=F    read input from the files specified by\n\
                            NUL-terminated names in file F;\n\
                            If F is - then read names from standard input\n\
-  -L, --max-line-length  print the length of the longest line\n\
+  -L, --max-line-length  print the length of the longest line in screen\n\
+                         columns (counting tabs and multi-byte characters)\n\
   -w, --words            print the word counts\n\
 "), stdout);
       fputs (HELP_OPTION_DESCRIPTION, stdout);
-- 
1.9.1

wc: expand help of '-L' (and a question)

Reply via email to