Re: seq feature: print letters
On 01/26/2015 03:39 PM, Pádraig Brady wrote: On 25/01/15 05:10, Assaf Gordon wrote: I'm thinking that perhaps it would be better not to include this in 'coreutils', and instead put it in another, separate project. This way, there's no worries about adding bloat to coreutils, while being more flexible in adding other features (like additional character sets from latest unicode). ... I was thinking of features like: ... That does sound like it's getting out of scope for seq So I believe it is down to a judgement call, as to whether include the feature in coreutils' 'seq', and keep it minimal, or move it to a separate project, and expand it over time (or both - the minimal index letters in coreutils, and more sets in a separate project?). The decision is of course yours. As a side note, In the patch that I've sent, some of the sets of letters have been copied from the Unicode/CLDR website and data files. To the best of my understanding, this is fully compatible with GPL ( see http://www.gnu.org/licenses/license-list.html#Unicode ), But it might be needed to include the unicode license file as an additional file. I can send an improved patch if we've proceeding with including this feature. Regards, - Assaf
RE: seq feature: print letters
Date: Wed, 28 Jan 2015 12:32:48 -0500 From: assafgor...@gmail.com To: p...@draigbrady.com Subject: Re: seq feature: print letters CC: coreutils@gnu.org On 01/26/2015 03:39 PM, Pádraig Brady wrote: On 25/01/15 05:10, Assaf Gordon wrote: I'm thinking that perhaps it would be better not to include this in 'coreutils', and instead put it in another, separate project. This way, there's no worries about adding bloat to coreutils, while being more flexible in adding other features (like additional character sets from latest unicode). I think that bloat is an important issue. Systems with limited resources need to run coreutils. Would a smart watch need to print a sequence of letters to run? Adding letters creates issues with loading unicode character set tables and creeping featurism if later seq needs to implement all of the listing methods common in word processors (upper case, lower case, what happens after z, roman numerals, etc.) It breaks the unix philosophy of doing one thing and doing it well. If you need a sequence with letters, you can always use another filter to convert numbers to letters, for example, seq 1 10 | awk -e '{ printf %c\n, ($1+64) }' or seq 1 10 | perl -e 'use strict; use locale; while () { printf %s\n, chr($_ + ord(a) - 1); }' or seq 1 10 | perl -e 'use strict; use Roman; while () { printf %s\n, Roman($_); }' William
Re: seq feature: print letters
Hello William, On Jan 28, 2015, at 19:57, William Bader williamba...@hotmail.com wrote: ... I think that bloat is an important issue. Systems with limited resources need to run coreutils. Would a smart watch need to print a sequence of letters to run? ... If you need a sequence with letters, you can always use another filter to convert numbers to letters, for example, seq 1 10 | awk -e '{ printf %c\n, ($1+64) }' This example works well for English, but English characters are rarely an issue, since many shells support the {A..Z} syntax. However for almost all other non-English languages there are unique and specialized sequences in the unicode standard, such as non-sequential point-codes and multi-symbol letters. A visual way to appreciate the complexity is the unicode/CLDR website and its charts: http://www.unicode.org/cldr/charts/26/by_type/core_data.alphabetic_information.index.html Scrolling down to the latin languages chart section, one can see the variability in letter inclusion for each language. Another issue is properly supporting all the environments in which coreutils can operate, including non utf-8 locales, and even EBCDIC (in which even English letters are not consecutive, e.g. this post from 2005: http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00189.html ). The current suggested patch handles all those cases, at the cost of including the unicode modules from gnulib. These are the main reasons for the complexity/size of the feature. --- This is not to say the feature is worth or not worth the added size (or bloat); I think by now it's not a technical decision, but more of a strategic one. I personally like it, but I can understand if others prefer not to include it in coreutils and put it elsewhere. - Assaf
Re: seq feature: print letters
Hello Pádraig, On Jan 25, 2015, at 6:13, Pádraig Brady p...@draigbrady.com wrote: On 25/01/15 05:10, Assaf Gordon wrote: ... I'm thinking that perhaps it would be better not to include this in 'coreutils', and instead put it in another, separate project. This way, there's no worries about adding bloat to coreutils, while being more flexible in adding other features (like additional character sets from latest unicode). I'm not sure. I was considering this for the release of coreutils after the imminent 8.24 one. I'm thinking V9 will start linking various utils to libunistring, and doing so in seq may not be much of a stretch. If it's still up for inclusion in the next version, then that's great. My thoughts were that within the 'coreutils' context, every additional feature will always be evaluated as a trade-off for extra bloat. Where as outside 'coreutils', adding more features could be easier, and bloat will be less of an issue (as in - if someone wanted these features, he/she will explicitly install the program). I was thinking of features like: 1. adding more unicode blocks (even exotic ones, like 'runes', 'dingbats', 'braille', etc.) 2. adding more alphabet categories (i.g. not just the indexed letters, but auxiliary letters, or upper-case/lower-case letters, or different letter glyphs for languages that have them) 3. Adding a text generator to create dummy text with a given alphabet regards, - assaf
Re: seq feature: print letters
On 26/01/15 18:04, Assaf Gordon wrote: Hello Pádraig, On Jan 25, 2015, at 6:13, Pádraig Brady p...@draigbrady.com wrote: On 25/01/15 05:10, Assaf Gordon wrote: ... I'm thinking that perhaps it would be better not to include this in 'coreutils', and instead put it in another, separate project. This way, there's no worries about adding bloat to coreutils, while being more flexible in adding other features (like additional character sets from latest unicode). I'm not sure. I was considering this for the release of coreutils after the imminent 8.24 one. I'm thinking V9 will start linking various utils to libunistring, and doing so in seq may not be much of a stretch. If it's still up for inclusion in the next version, then that's great. My thoughts were that within the 'coreutils' context, every additional feature will always be evaluated as a trade-off for extra bloat. Where as outside 'coreutils', adding more features could be easier, and bloat will be less of an issue (as in - if someone wanted these features, he/she will explicitly install the program). I was thinking of features like: 1. adding more unicode blocks (even exotic ones, like 'runes', 'dingbats', 'braille', etc.) 2. adding more alphabet categories (i.g. not just the indexed letters, but auxiliary letters, or upper-case/lower-case letters, or different letter glyphs for languages that have them) 3. Adding a text generator to create dummy text with a given alphabet That does sound like it's getting out of scope for seq
Re: seq feature: print letters
On 25/01/15 05:10, Assaf Gordon wrote: Hello Pádraig and all, Regarding the seq + letters feature (originally: http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html ). I'm thinking that perhaps it would be better not to include this in 'coreutils', and instead put it in another, separate project. This way, there's no worries about adding bloat to coreutils, while being more flexible in adding other features (like additional character sets from latest unicode). WDYT? I'm not sure. I was considering this for the release of coreutils after the imminent 8.24 one. I'm thinking V9 will start linking various utils to libunistring, and doing so in seq may not be much of a stretch. I'll post a plan about that soon. thanks, Pádraig
Re: seq feature: print letters
Hello Pádraig and all, Regarding the seq + letters feature (originally: http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html ). I'm thinking that perhaps it would be better not to include this in 'coreutils', and instead put it in another, separate project. This way, there's no worries about adding bloat to coreutils, while being more flexible in adding other features (like additional character sets from latest unicode). WDYT? Regards, - assaf
Re: seq feature: print letters
Hello Pádraig, Thanks for considering these patches. On 10/07/2014 07:51 PM, Pádraig Brady wrote: Looks like there will be another 8.x release before 9.x opens. I intend to include this (and the sort/uniq/join field unification). Hopefully it wont be too long. There's obviously no rush to include them, so if 8.24 is a quick bug-fix release, no need to delay it for them. Regarding the sort/uniq/join: I think the sort part is solid: the changes are minimal (mostly extracting the code to another file). The uniq/join needs review. I think they are good, but could always use a closer look. Main concern is introducing a regression: all tests pass with this patch, so I hope there are no regressions, but I'm not sure the join/uniq tests cover all the bases. The join/uniq need a NEWS entry, but with all the rebasing, I found NEWS to cause the most conflicts :) I removed it from the patch, but the text could be something like: === join accepts new options: --dictionary-order(-d), --general-numeric-sort(-g), --numeric-sort(-n), --reverse(-r) affecting key comparison. These modifiers make join more compatible with sort's --key specifications. uniq accepts a new option: --key (-k) to determine uniqueness of lines based on key specification, similar to sort's --key specifications. === Regarding the alphabet: There are fours parts: 1. src/alphabet_data.c - the encoded alphabet data structure 2. scripts/encode_alphabets.pl - the script to generate the above file 3. src/alphabet.{c,h} - the decoder for the alphabet data structure 4. src/seq.c - command-line argument processing to call new functions. These might be more 'controversial' compared to existing code in coreutils, so perhaps it will take time to review and accept.
Re: seq feature: print letters
Hello, ... continuing the 'seq --alphabet' thread: http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html http://lists.gnu.org/archive/html/coreutils/2014-08/msg1.html Attached is a updated/rebased patch. Comments are welcomed, -gordon seq_letters.2014-10-07.patch.xz Description: application/xz
Re: seq feature: print letters
On 10/08/2014 12:42 AM, Assaf Gordon wrote: Hello, ... continuing the 'seq --alphabet' thread: http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html http://lists.gnu.org/archive/html/coreutils/2014-08/msg1.html Attached is a updated/rebased patch. Comments are welcomed, -gordon Thanks Assaf. Looks like there will be another 8.x release before 9.x opens. I intend to include this (and the sort/uniq/join field unification). Hopefully it wont be too long. cheers, Pádraig.
Re: seq feature: print letters
... continuing this thread: On 06/30/2014 06:23 AM, assafgor...@gmail.com wrote: I'd like to suggest a patch to allow seq to generate letter sequences. Attached is an improved patch, with documentation and additional option of --list-alphabets - printing all the supported language codes. To print all letters of all supported alphabets, run: $ for i in $(./src/seq --list-al) ; do \ printf %-5s = $i ; \ ./src/seq --al=$i -s ; \ done aa= A B T S E C K X I D Q R F G O L M N U W H Y af= A B C D E F G H I J K L M N O P Q R S T U V W X Y Z agq = A B C D E Ɛ F G H I Ɨ K L M N Ŋ O Ɔ P S T U Ʉ V W Y Z ʔ ak= A B C D E Ɛ F G H I J K L M N O Ɔ P Q R S T U V W X Y Z am= ሀ ለ ሐ መ ሠ ረ ሰ ሸ ቀ ቈ በ ቨ ተ ቸ ኀ ኈ ነ ኘ አ ከ ኰ ኸ ወ ዐ ዘ ዠ የ ደ ጀ ገ ጐ ጠ ጨ ጰ ጸ ፀ ፈ ፐ ar= ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي as= অ আ ই ঈ উ ঊ ঋ ৠ ঌ ৡ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ৎ ত থ দ ধ ন প ফ ব ভ ম য ৰ ল ৱ শ ষ স হ ঽ ... Comments are welcomed, - Assaf seq_letters.2014-08-01.patch.xz Description: application/xz
Re: seq feature: print letters
Hello, On 06/30/2014 06:23 AM, assafgor...@gmail.com wrote: I'd like to suggest a patch to allow seq to generate letter sequences. Attached is an improved implementation for the same functionality: ( http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html ) With this patch, 'seq' can print letters of alphabets in the current locale (or user-specified language). Examples: # print all letters in the current alphabet seq --alphabet seq -a # print the first 10 letters in the current alphabet seq -a 10 # print the letters of the Russian alphabet # (assuming the locale is installed) LC_ALL=ru_RU.utf-8 seq -a # print the letters of the hebrew alphabet # (assuming the current locale supports UTF-8 or # other encoding supported by gnulib/libunistring) seq --alphabet=he The new data takes ~5100 bytes (instead of previous 15KB). It requires (one time) encoding of a 'database' textual file (included) using a perl script (included). Conceptually similar to the unicode tables, this only needs to be done when an alphabet is updated. The alphabets are encoded in 'src/alphabets_data.h'. The decoder is in 'src/alphabets.{c,h}' . The added functionality is in few new functions in 'src/seq.c' . === If you think that this is an acceptable feature (at least conceptually), then I'd be happy to discuss further details, such as which languages to include, and implementation suggestions (for example, should this be moved to gnulib?). Are there any important encoding issues I might have missed (the code tries to be as portable as possible, internally storing UCS values, converting them to UTF8 with 'u8-uctomb()', then printing them with 'u8-strconv-to-locale()' - so no assumption about the active encoding). Should there be an interface for multi-letter output (e.g. aa after z), === Regarding Bernhard's comment: On 07/03/2014 02:18 AM, Bernhard Voelker wrote: The user could let the shell produce the input: $ printf %c {a..z} | seq -s ' ' --alpha=- 2 2 6 b d f thus picking the Nth character from the input. ;-) I don't think this example is portable, as {a..z} is not in POSIX sh, so can't be used in scripting. However, more generally, it's easy to generate ranges of unicode symbols if their value is known: # Arabic letters (unicode block 0x627 - 0x64a) seq $((0x627)) $((0x64a)) | xargs env printf 'u%04xn' | xargs env printf # Cyrillic letters (unicode block 0x410 - 0x42f) seq $((0x410)) $((0x42f)) | xargs env printf 'u%04xn' | xargs env printf But the problem is that official alphabets letters for each language are very irregular: For example, few letters in the Arabic block aren't official ordinal letters (they are valid alphabet symbols for letter under certain conditions). Also, in some languages, a letter is actually two unicode symbols (e.g. in Czech, Ch is a single letter, in addition to the C and H letters). In non-english latin based languages, besides the simple ASCII letters of A-Z, there are additional symbols which are not sequential unicode values. Whether this feature is desired or not in coreutils is one question. But if it is (for more languages than English), then I think simple ranges will not suffice. Comments are welcomed, -gordon seq_alphabet.2014-07-08.patch.xz Description: application/xz
Re: seq feature: print letters
On 07/03/2014 02:02 AM, Assaf Gordon wrote: if the user has to type them explicitly, then seq is no better than printf '%s\n' followed by all the letters typed by the user... The user could let the shell produce the input: $ printf %c {a..z} | seq -s ' ' --alpha=- 2 2 6 b d f thus picking the Nth character from the input. ;-) Have a nice day, Berny
Re: seq feature: print letters
On Jul 1, 2014, at 2:21, Bernhard Voelker m...@bernhard-voelker.de wrote: Hmm, what about just providing the standard A-Z alphabet, and instead leave it up to the user if she needs a different set (rolling over if needed)? I like the idea of seq using user-specified sequence of characters (though this brings it's own issues), But my goal was to provide an easy way to generate letters in many languages - if the user has to type them explicitly, then seq is no better than printf '%s\n' followed by all the letters typed by the user... What do you think? I'm still working on an improved patch with much more efficient storage. Hope to have it in a week or so. Regards, - gordon
Re: seq feature: print letters
On 06/30/2014 11:23 AM, assafgor...@gmail.com wrote: Hello, I'd like to suggest a patch to allow seq to generate letter sequences. With this patch, 'seq' can print letters of alphabets in the current locale (or user-specified language). Examples: # print all letters in the current alphabet seq --alphabet seq -a # print the first 10 letters in the current alphabet seq -a 10 # print the fifth to tenth letters of the current alphabet seq -a 5 10 # print the letters of the Russian alphabet # (assuming the locale is installed) LC_ALL=ru_RU.utf-8 seq -a # print the letters of the hebrew alphabet # (assuming the current locale supports UTF-8 or # other encoding supported by gnulib/libunistring) seq --alphabet=he More details follow: This has been suggested before, and there were several hurdles: 1. How to handle non C locales (with letters beyond the 7-bit ASCII) 2. How to handle EBCDIC (or other standard were the letters are not sequential in their ordinal values) 3. How to handle input letters (eg seq from à to ö). I believe the following patch can address these issues. 1. Seq in alphabet mode will deal with defined sequences of letters in each language: Instead of dealing with numeric codes of letters (e.g. From 65=A to 69=E), it deal with first letter in language EN to fifth letter in language EN (or any other language). 2. Unicode/CLDR already maintains a list of official letters of the alphabet for each language. Note that the list is not the same as isalpha(): it only contains the list of letters in the official alphabet. Example: In English/EN, the list contains A-Z, as expected. In French/FR, the list still contains just A-Z - those are the official letters in the French alphabet, while acute accent, grave accent and circumflex letters ( é è à â etc) are only considered diacritics, not stand-alone letters. In Swedish/SV, the list contains A-Z plus å ä ö, while à é are considered diacritics. Similar lists are maintained for each language in the Unicode a database. 3. Internally, seq will store the list of letters for each language as UTF-8 - this will avoid ambiguity, and gnulib's function will provide conversion to the current encoding. If this approach is acceptable, then we can plan for further features, such as: 4. Allow multi-character output, eg, with English, after z wrap to aa, ab, etc. 5. Allow specifying start/end with letters instead of numbers ( eg seq --abc é z), and apply collating rules to find which character in the alphabet to start from. 6. The language database (in './src/alphabet.c') is not perfect. It was automatically generated by extracting infomration from the Unicode/CLDR XML files. For some language there are obvious errors (such as characters incorrectly converted from designation such \u093C). For other language, the code used by unicode is not necessarily compatible with the locale name. But for most language I believe the information is valid, and for the few incorrect definitions, I think they could be easily fixed by manual inspection. Comments are welcomed, - Gordon I like it! The interface is concise and fits seq well. I see the jot util has similar functionality confirming the usefulness. I notice about 45 copies of the A-Z alphabet, would it be worth introducing aliases to avoid copies? What about case. The current code only has upper case. case is a can of worms I know, with not necessarily 1:1 mapping etc. The data being leveraged is well defined at present reasonable to include directly in the seq binary (about 12K I'm guessing), though have you looked at whether libunistring contains the appropriate data/logic for this? This might be more significant if case or more characters were considered for example. I had a quick look at the CLDR. Are you only considering the Index exemplar chars here? http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html Maybe it would be better to default to the standard exemplars? http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters thanks! Pádraig
Re: seq feature: print letters
On Jun 30, 2014, at 5:24, Pádraig Brady p...@draigbrady.com wrote: On 06/30/2014 11:23 AM, assafgor...@gmail.com wrote: I'd like to suggest a patch to allow seq to generate letter sequences. I notice about 45 copies of the A-Z alphabet, would it be worth introducing aliases to avoid copies? Yes, we can consolidate them. What about case. The current code only has upper case. case is a can of worms I know, with not necessarily 1:1 mapping etc. Once leaving the realm of latin languages, upper/lower case indeed becomes very complicated. Or even meaningless. I thought that 'tr [:upper:] [:lower:]' would handle it better (but I now realize tr doesn't support UTF-8 well, if I understand correctly). I think that for the first step, we should not deal with upper/lower case issues. The data being leveraged is well defined at present reasonable to include directly in the seq binary (about 12K I'm guessing), though have you looked at whether libunistring contains the appropriate data/logic for this? This might be more significant if case or more characters were considered for example. This first draft stores UTF-8 strings (with NUL) for each character. I saw the libunistring code stores some bit-fields for some of the functions, though I haven't learned it yet. I will try to improve the storage method in following patches. I had a quick look at the CLDR. Are you only considering the Index exemplar chars here? http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html Exactly. Maybe it would be better to default to the standard exemplars? http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters The reason I liked to index list, is because it most directly answers the question what is the alphabet in language X ? (is in, what are the letters that would be taught in schools as the alphabet, or if you ask a person on the street to list the alphabet letters). It also lends itself to do: # How many letters are in the Arabic alphabet: seq --alphabet=ar | wc -l # What is the eleventh letter in the Russian alphabet: seq --alphabet=ru | awk 'NR==11' Technically, the functionality of is_alpha() does not correspond 1:1 to the alphabet, which is part of the problem... In English, there are no complications, but in many other languages it becomes complicated. Using other Unicode categories (e.g.the 'main' letters or even 'auxiliary' letters) answers a slightly different question, more akin to what symbols are acceptable in language X ? - not a bad question, just different that the previous question. For example in Hebrew, the index list contains 22 letters (which agrees with the question how many letters are in the Hebrew alphabet), but the main/standard list has 5 more symbols, of 5 hebrew letters that have specific final form (if those letters appear at the end of the word). So using the main list would list 5 letters twice. I believe other language such as Arabic would present similar issues. From a technical point of view, it's easy to include both index and standard letters (with different command-line options), it's just a matter of adding more lists. What do you think? Thanks, -Gordon