Re: seq feature: print letters

2015-01-28 Thread Assaf Gordon

On 01/26/2015 03:39 PM, Pádraig Brady wrote:

On 25/01/15 05:10, Assaf Gordon wrote:



I'm thinking that perhaps it would be better not to include this in 
'coreutils', and instead put it in another, separate project.
This way, there's no worries about adding bloat to coreutils, while being more 
flexible in adding other features (like additional character sets from latest 
unicode).


...


I was thinking of features like:


...


That does sound like it's getting out of scope for seq
 
So I believe it is down to a judgement call, as to whether include the feature in coreutils' 'seq', and keep it minimal,

or move it to a separate project, and expand it over time (or both - the minimal 
index letters in coreutils, and more sets in a separate project?).

The decision is of course yours.


As a side note,
In the patch that I've sent, some of the sets of letters have been copied from 
the Unicode/CLDR website and data files.
To the best of my understanding, this is fully compatible with GPL ( see 
http://www.gnu.org/licenses/license-list.html#Unicode ),
But it might be needed to include the unicode license file as an additional 
file.
I can send an improved patch if we've proceeding with including this feature.


Regards,
 - Assaf





RE: seq feature: print letters

2015-01-28 Thread William Bader
 Date: Wed, 28 Jan 2015 12:32:48 -0500
 From: assafgor...@gmail.com
 To: p...@draigbrady.com
 Subject: Re: seq feature: print letters
 CC: coreutils@gnu.org
 
 On 01/26/2015 03:39 PM, Pádraig Brady wrote:
  On 25/01/15 05:10, Assaf Gordon wrote:
 
  I'm thinking that perhaps it would be better not to include this in 
  'coreutils', and instead put it in another, separate project.
  This way, there's no worries about adding bloat to coreutils, while 
  being more flexible in adding other features (like additional character 
  sets from latest unicode).

I think that bloat is an important issue.  Systems with limited resources need 
to run coreutils.  Would a smart watch need to print a sequence of letters to 
run?
Adding letters creates issues with loading unicode character set tables and 
creeping featurism if later seq needs to implement all of the listing methods 
common in word processors (upper case, lower case, what happens after z, roman 
numerals, etc.)
It breaks the unix philosophy of doing one thing and doing it well.
If you need a sequence with letters, you can always use another filter to 
convert numbers to letters, for example,
seq 1 10 | awk -e '{ printf %c\n, ($1+64) }'
or
seq 1 10 | perl -e 'use strict; use locale; while () { printf %s\n, chr($_ 
+ ord(a) - 1); }'
or
 seq 1 10 | perl -e 'use strict; use Roman; while () { printf %s\n, 
Roman($_); }'
William
  

Re: seq feature: print letters

2015-01-28 Thread Assaf Gordon
Hello William,

On Jan 28, 2015, at 19:57, William Bader williamba...@hotmail.com wrote:

...
 I think that bloat is an important issue.  Systems with limited resources 
 need to run coreutils.  Would a smart watch need to print a sequence of 
 letters to run?
...
 If you need a sequence with letters, you can always use another filter to 
 convert numbers to letters, for example,
 
 seq 1 10 | awk -e '{ printf %c\n, ($1+64) }'

This example works well for English, but English characters are rarely an 
issue, since many shells support the {A..Z} syntax.

However for almost all other non-English languages there are unique and 
specialized sequences in the unicode standard, such as non-sequential 
point-codes and multi-symbol letters.

A visual way to appreciate the complexity is the unicode/CLDR website and its 
charts:
  
http://www.unicode.org/cldr/charts/26/by_type/core_data.alphabetic_information.index.html

Scrolling down to the latin languages chart section, one can see the 
variability in letter inclusion for each language. 

Another issue is properly supporting all the environments in which coreutils 
can operate, including non utf-8 locales, and even EBCDIC (in which even 
English letters are not consecutive, e.g. this post from 2005: 
http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00189.html ). 

The current suggested patch handles all those cases, at the cost of including 
the unicode modules from gnulib.

These are the main reasons for the complexity/size of the feature.

---

This is not to say the feature is worth or not worth the added size (or bloat); 
I think by now it's not a technical decision, but more of a strategic one.

I personally like it, but I can understand if others prefer not to include it 
in coreutils and put it elsewhere.

- Assaf





Re: seq feature: print letters

2015-01-26 Thread Assaf Gordon
Hello Pádraig,

On Jan 25, 2015, at 6:13, Pádraig Brady p...@draigbrady.com wrote:

 On 25/01/15 05:10, Assaf Gordon wrote:
 ...

 I'm thinking that perhaps it would be better not to include this in 
 'coreutils', and instead put it in another, separate project.
 This way, there's no worries about adding bloat to coreutils, while being 
 more flexible in adding other features (like additional character sets from 
 latest unicode).
 
 I'm not sure. I was considering this for the release of coreutils
 after the imminent 8.24 one.  I'm thinking V9 will start linking
 various utils to libunistring, and doing so in seq may not be much
 of a stretch.
 

If it's still up for inclusion in the next version, then that's great.

My thoughts were that within the 'coreutils' context, every additional feature 
will always be evaluated as a trade-off for extra bloat.
Where as outside 'coreutils', adding more features could be easier, and bloat 
will be less of an issue (as in - if someone wanted these features, he/she will 
explicitly install the program).

I was thinking of features like:
1. adding more unicode blocks (even exotic ones, like 'runes', 'dingbats', 
'braille', etc.)
2. adding more alphabet categories (i.g. not just the indexed letters, but 
auxiliary letters, or upper-case/lower-case letters, or different letter glyphs 
for languages that have them)
3. Adding a text generator to create dummy text with a given alphabet

regards,
 - assaf


Re: seq feature: print letters

2015-01-26 Thread Pádraig Brady
On 26/01/15 18:04, Assaf Gordon wrote:
 Hello Pádraig,
 
 On Jan 25, 2015, at 6:13, Pádraig Brady p...@draigbrady.com wrote:
 
 On 25/01/15 05:10, Assaf Gordon wrote:
 ...
 
 I'm thinking that perhaps it would be better not to include this in 
 'coreutils', and instead put it in another, separate project.
 This way, there's no worries about adding bloat to coreutils, while being 
 more flexible in adding other features (like additional character sets from 
 latest unicode).

 I'm not sure. I was considering this for the release of coreutils
 after the imminent 8.24 one.  I'm thinking V9 will start linking
 various utils to libunistring, and doing so in seq may not be much
 of a stretch.

 
 If it's still up for inclusion in the next version, then that's great.
 
 My thoughts were that within the 'coreutils' context, every additional 
 feature will always be evaluated as a trade-off for extra bloat.
 Where as outside 'coreutils', adding more features could be easier, and bloat 
 will be less of an issue (as in - if someone wanted these features, he/she 
 will explicitly install the program).
 
 I was thinking of features like:
 1. adding more unicode blocks (even exotic ones, like 'runes', 'dingbats', 
 'braille', etc.)
 2. adding more alphabet categories (i.g. not just the indexed letters, but 
 auxiliary letters, or upper-case/lower-case letters, or different letter 
 glyphs for languages that have them)
 3. Adding a text generator to create dummy text with a given alphabet

That does sound like it's getting out of scope for seq




Re: seq feature: print letters

2015-01-25 Thread Pádraig Brady
On 25/01/15 05:10, Assaf Gordon wrote:
 Hello Pádraig and all,
 
 Regarding the seq + letters feature
 (originally: 
 http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html ).
 
 I'm thinking that perhaps it would be better not to include this in 
 'coreutils', and instead put it in another, separate project.
 This way, there's no worries about adding bloat to coreutils, while being 
 more flexible in adding other features (like additional character sets from 
 latest unicode).
 
 WDYT?

I'm not sure. I was considering this for the release of coreutils
after the imminent 8.24 one.  I'm thinking V9 will start linking
various utils to libunistring, and doing so in seq may not be much
of a stretch.

I'll post a plan about that soon.

thanks,
Pádraig




Re: seq feature: print letters

2015-01-24 Thread Assaf Gordon
Hello Pádraig and all,

Regarding the seq + letters feature
(originally: http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html 
).

I'm thinking that perhaps it would be better not to include this in 
'coreutils', and instead put it in another, separate project.
This way, there's no worries about adding bloat to coreutils, while being more 
flexible in adding other features (like additional character sets from latest 
unicode).

WDYT?

Regards,
 - assaf




Re: seq feature: print letters

2014-10-08 Thread Assaf Gordon

Hello Pádraig,

Thanks for considering these patches.

On 10/07/2014 07:51 PM, Pádraig Brady wrote:

Looks like there will be another 8.x release before 9.x opens.
I intend to include this (and the sort/uniq/join field unification).
Hopefully it wont be too long.


There's obviously no rush to include them, so if 8.24 is a quick bug-fix 
release, no need to delay it for them.

Regarding the sort/uniq/join:
I think the sort part is solid: the changes are minimal (mostly extracting 
the code to another file).
The uniq/join needs  review. I think they are good, but could always use a 
closer look.
Main  concern is introducing a regression: all tests pass with this patch, so I 
hope there are no regressions, but
I'm not sure the join/uniq tests cover all the bases.

The join/uniq need a NEWS entry, but with all the rebasing, I found NEWS to 
cause the most conflicts :)
I removed it from the patch, but the text could be something like:
===
  join accepts new options: --dictionary-order(-d), --general-numeric-sort(-g),
  --numeric-sort(-n), --reverse(-r) affecting key comparison. These modifiers
  make join more compatible with sort's --key specifications.

  uniq accepts a new option: --key (-k) to determine uniqueness of lines based
  on key specification, similar to sort's --key specifications.
===


Regarding the alphabet:
There are fours parts:
1. src/alphabet_data.c - the encoded alphabet data structure
2. scripts/encode_alphabets.pl - the script to generate the above file
3. src/alphabet.{c,h} - the decoder for the alphabet data structure
4. src/seq.c - command-line argument processing to call new functions.
These might be more 'controversial' compared to existing code in coreutils, so 
perhaps it will take time to review and accept.





Re: seq feature: print letters

2014-10-07 Thread Assaf Gordon

Hello,

... continuing the 'seq --alphabet' thread:

  http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html
  http://lists.gnu.org/archive/html/coreutils/2014-08/msg1.html

Attached is a updated/rebased  patch.

Comments are welcomed,
 -gordon


seq_letters.2014-10-07.patch.xz
Description: application/xz


Re: seq feature: print letters

2014-10-07 Thread Pádraig Brady
On 10/08/2014 12:42 AM, Assaf Gordon wrote:
 Hello,
 
 ... continuing the 'seq --alphabet' thread:
 
   http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html
   http://lists.gnu.org/archive/html/coreutils/2014-08/msg1.html
 
 Attached is a updated/rebased  patch.
 
 Comments are welcomed,
  -gordon

Thanks Assaf.

Looks like there will be another 8.x release before 9.x opens.
I intend to include this (and the sort/uniq/join field unification).
Hopefully it wont be too long.

cheers,
Pádraig.



Re: seq feature: print letters

2014-08-01 Thread Assaf Gordon

... continuing this thread:


On 06/30/2014 06:23 AM, assafgor...@gmail.com wrote:

I'd like to suggest a patch to allow seq to generate letter sequences.


Attached is an improved patch, with documentation and additional option of 
--list-alphabets - printing all the supported language codes.

To print all letters of all supported alphabets, run:

$ for i in $(./src/seq --list-al) ; do \
  printf %-5s =  $i ; \
 ./src/seq --al=$i -s   ; \
   done
aa= A B T S E C K X I D Q R F G O L M N U W H Y
af= A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
agq   = A B C D E Ɛ F G H I Ɨ K L M N Ŋ O Ɔ P S T U Ʉ V W Y Z ʔ
ak= A B C D E Ɛ F G H I J K L M N O Ɔ P Q R S T U V W X Y Z
am= ሀ ለ ሐ መ ሠ ረ ሰ ሸ ቀ ቈ በ ቨ ተ ቸ ኀ ኈ ነ ኘ አ ከ ኰ ኸ ወ ዐ ዘ ዠ የ ደ ጀ ገ ጐ ጠ ጨ ጰ ጸ ፀ 
ፈ ፐ
ar= ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي
as= অ আ ই ঈ উ ঊ ঋ ৠ ঌ ৡ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ৎ ত থ দ ধ ন প 
ফ ব ভ ম য ৰ ল ৱ শ ষ স হ ঽ
...


Comments are welcomed,
 - Assaf


seq_letters.2014-08-01.patch.xz
Description: application/xz


Re: seq feature: print letters

2014-07-08 Thread Assaf Gordon

Hello,

On 06/30/2014 06:23 AM, assafgor...@gmail.com wrote:

I'd like to suggest a patch to allow seq to generate letter sequences.


Attached is an improved implementation for the same functionality:
( http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html )


With this patch, 'seq' can print letters of alphabets in the current locale
(or user-specified language). Examples:

 # print all letters in the current alphabet
 seq --alphabet
 seq -a
 # print the first 10 letters in the current alphabet
 seq -a 10
 # print the letters of the Russian alphabet
 # (assuming the locale is installed)
 LC_ALL=ru_RU.utf-8 seq -a
 # print the letters of the hebrew alphabet
 # (assuming the current locale supports UTF-8 or
 #  other encoding supported by gnulib/libunistring)
 seq --alphabet=he



The new data takes ~5100 bytes (instead of previous 15KB).

It requires (one time) encoding of a 'database' textual file (included) using a 
perl script (included).
Conceptually similar to the unicode tables, this only needs to be done when an 
alphabet is updated.

The alphabets are encoded in 'src/alphabets_data.h'.
The decoder is in 'src/alphabets.{c,h}' .
The added functionality is in few new functions in 'src/seq.c' .

===

If you think that this is an acceptable feature (at least conceptually), then 
I'd be happy to discuss further details,
such as which languages to include, and implementation suggestions (for 
example, should this be moved to gnulib?).

Are there any important encoding issues I might have missed (the code tries to 
be as portable as possible, internally storing UCS values, converting them to 
UTF8 with 'u8-uctomb()', then printing them with 'u8-strconv-to-locale()' - so 
no assumption about the active encoding).

Should there be an interface for multi-letter output (e.g. aa after z),

===

Regarding Bernhard's comment:

On 07/03/2014 02:18 AM, Bernhard Voelker wrote:

The user could let the shell produce the input:
   $ printf %c {a..z} | seq -s ' ' --alpha=- 2 2 6
   b d f
thus picking the Nth character from the input. ;-)


I don't think this example is portable, as {a..z} is not in POSIX sh, so 
can't be used in scripting.

However, more generally, it's easy to generate ranges of unicode symbols if 
their value is known:

# Arabic letters (unicode block 0x627 - 0x64a)
seq $((0x627)) $((0x64a)) | xargs env printf 'u%04xn' | xargs env 
printf

# Cyrillic letters (unicode block 0x410 - 0x42f)

seq $((0x410)) $((0x42f)) | xargs env printf 'u%04xn' | xargs env 
printf

But the problem is that official alphabets letters for each language are very 
irregular:
For example, few letters in the Arabic block aren't official ordinal letters 
(they are valid alphabet symbols
for letter under certain conditions).
Also, in some languages, a letter is actually two unicode symbols (e.g. in Czech, Ch is a single 
letter, in addition to the C and H letters).
In non-english latin based languages, besides the simple ASCII letters of A-Z, 
there are additional symbols which are not sequential unicode values.

Whether this feature is desired or not in coreutils is one question. But if it is (for 
more languages than English), then I think simple ranges will not suffice.


Comments are welcomed,
 -gordon
















seq_alphabet.2014-07-08.patch.xz
Description: application/xz


Re: seq feature: print letters

2014-07-03 Thread Bernhard Voelker
On 07/03/2014 02:02 AM, Assaf Gordon wrote:
 if the user has to type them explicitly, then seq is no better
 than printf '%s\n'  followed by all the letters typed by the user...

The user could let the shell produce the input:

  $ printf %c {a..z} | seq -s ' ' --alpha=- 2 2 6
  b d f

thus picking the Nth character from the input. ;-)

Have a nice day,
Berny



Re: seq feature: print letters

2014-07-02 Thread Assaf Gordon

 On Jul 1, 2014, at 2:21, Bernhard Voelker m...@bernhard-voelker.de wrote:
 
 Hmm, what about just providing the standard A-Z alphabet,
 and instead leave it up to the user if she needs a different
 set (rolling over if needed)?

I like the idea of seq using user-specified sequence of characters (though this 
brings it's own issues),
But my goal was to provide an easy way to generate letters in many languages - 
if the user has to type them explicitly, then seq is no better than printf 
'%s\n'  followed by all the letters typed by the user...

What do you think?

I'm still working on an improved patch with much more efficient storage. Hope 
to have it in a week or so.

Regards,
  - gordon


Re: seq feature: print letters

2014-06-30 Thread Pádraig Brady
On 06/30/2014 11:23 AM, assafgor...@gmail.com wrote:
 Hello,
 
 I'd like to suggest a patch to allow seq to generate letter sequences.
 
 With this patch, 'seq' can print letters of alphabets in the current locale
 (or user-specified language). Examples:
 
 # print all letters in the current alphabet
 seq --alphabet
 seq -a
 # print the first 10 letters in the current alphabet
 seq -a 10
 # print the fifth to tenth letters of the current alphabet
 seq -a 5 10
 # print the letters of the Russian alphabet
 # (assuming the locale is installed)
 LC_ALL=ru_RU.utf-8 seq -a
 # print the letters of the hebrew alphabet
 # (assuming the current locale supports UTF-8 or
 #  other encoding supported by gnulib/libunistring)
 seq --alphabet=he
 
 
 More details follow:
 
 This has been suggested before, and there were several hurdles:
 1. How to handle non C locales (with letters beyond the 7-bit ASCII)
 2. How to handle EBCDIC (or other standard were the letters are not 
 sequential in their ordinal values)
 3. How to handle input letters (eg seq from à to ö).
 
 I believe the following patch can address these issues.
 
 1. Seq in alphabet mode will deal with defined sequences of letters in each 
 language:
 Instead of dealing with numeric codes of letters (e.g. From 65=A to 69=E),
 it deal with first letter in language EN to fifth letter in language EN (or 
 any other language).
 
 2. Unicode/CLDR already maintains a list of official letters of the alphabet 
 for each language. Note that the list is not the same as isalpha(): it only 
 contains the list of letters in the official alphabet.
 
 Example:
 In English/EN, the list contains A-Z, as expected.
 In French/FR, the list still contains just A-Z - those are the official 
 letters in the French alphabet, while acute accent, grave accent and 
 circumflex letters ( é è à â etc) are only considered diacritics, not 
 stand-alone letters.
 In Swedish/SV, the list contains A-Z plus å ä ö, while à é are considered 
 diacritics.
 Similar lists are maintained for each language in the Unicode a database.
 
 3. Internally, seq will store the list of letters for each language as UTF-8 
 - this will avoid ambiguity, and gnulib's function will provide conversion to 
 the current encoding.
 
 If this approach is acceptable, then we can plan for further features, such 
 as:
 
 4. Allow multi-character output, eg, with English, after z wrap to aa, 
 ab, etc.
 
 5.  Allow specifying start/end with letters instead of numbers ( eg seq 
 --abc é z), and apply collating rules to find which character in the 
 alphabet to start from.
 
 6. The language database (in './src/alphabet.c') is not perfect. It was 
 automatically generated by extracting infomration from the Unicode/CLDR XML 
 files. For some language there are obvious errors (such as characters 
 incorrectly converted from designation such \u093C). For other language, 
 the code used by unicode is not necessarily compatible with the locale name. 
 But for most language I believe the information is valid, and for the few 
 incorrect definitions, I think they could be easily fixed by manual 
 inspection.
 
 Comments are welcomed,
   - Gordon

I like it!
The interface is concise and fits seq well.
I see the jot util has similar functionality confirming the usefulness.
I notice about 45 copies of the A-Z alphabet, would it be worth introducing 
aliases to avoid copies?
What about case. The current code only has upper case. case is a can of worms I 
know, with not necessarily 1:1 mapping etc.
The data being leveraged is well defined at present reasonable to include 
directly in the seq binary (about 12K I'm guessing),
though have you looked at whether libunistring contains the appropriate 
data/logic for this?
This might be more significant if case or more characters were considered for 
example.
I had a quick look at the CLDR. Are you only considering the Index exemplar 
chars here?
  
http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html
Maybe it would be better to default to the standard exemplars?
  http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

thanks!
Pádraig




Re: seq feature: print letters

2014-06-30 Thread Assaf Gordon

 On Jun 30, 2014, at 5:24, Pádraig Brady p...@draigbrady.com wrote:
 
 On 06/30/2014 11:23 AM, assafgor...@gmail.com wrote:
 I'd like to suggest a patch to allow seq to generate letter sequences.
 I notice about 45 copies of the A-Z alphabet, would it be worth introducing 
 aliases to avoid copies?

Yes, we can consolidate them.

 What about case. The current code only has upper case. case is a can of worms 
 I know, with not necessarily 1:1 mapping etc.

Once leaving the realm of latin languages, upper/lower case indeed becomes very 
complicated. Or even meaningless. I thought that 'tr [:upper:] [:lower:]' would 
handle it better (but I now realize tr doesn't support UTF-8 well, if I 
understand correctly).

I think that for the first step, we should not deal with upper/lower case 
issues.

 The data being leveraged is well defined at present reasonable to include 
 directly in the seq binary (about 12K I'm guessing),
 though have you looked at whether libunistring contains the appropriate 
 data/logic for this?
 This might be more significant if case or more characters were considered for 
 example.

This first draft stores UTF-8 strings (with NUL) for each character.  I saw the 
libunistring code stores some bit-fields for some of the functions, though I 
haven't learned it yet.
I will try to improve the storage method in following patches.

 I had a quick look at the CLDR. Are you only considering the Index exemplar 
 chars here?
 http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html

Exactly.

 Maybe it would be better to default to the standard exemplars?
 http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

The reason I liked to index list, is because it most directly answers the 
question what is the alphabet in language X ? (is in, what are the letters 
that would be taught in schools as the alphabet, or if you ask a person on 
the street to list the alphabet letters).
It also lends itself to do:
   # How many letters are in the Arabic alphabet:
seq --alphabet=ar | wc -l
   # What is the eleventh letter in the Russian alphabet:
seq --alphabet=ru | awk 'NR==11'

Technically, the functionality of is_alpha() does not correspond 1:1 to the 
alphabet, which is part of the problem... In English, there are no 
complications, but in many other languages it becomes complicated.

Using other Unicode categories (e.g.the 'main' letters or even 'auxiliary' 
letters) answers a slightly different question, more akin to what symbols are 
acceptable in language X ? - not a bad question, just different that the 
previous question.

For example in Hebrew, the index list contains 22 letters (which agrees with 
the question how many letters are in the Hebrew alphabet), but the 
main/standard list has 5 more symbols, of 5 hebrew letters that have specific 
final form (if those letters appear at the end of the word).
So using the main list would list 5 letters twice. I believe other language 
such as Arabic would present similar issues.

From a technical point of view, it's easy to include both index and 
standard letters (with different command-line options), it's just a matter of 
adding more lists.

What do you think?

Thanks,
 -Gordon