An attempt to summarize the pertinent points of the thread [1]. [1] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00241.html
* Semantics, generally `regexp-split' is similar to `string-split'. However, between various implementations the semantics vary over the following two points. It is important to consider appropriate compatability with these other implementations whilst still offering the user a good set of functionality. * Captured groups The Python [2] implementation contains unique semantics whereby the text of any captured groups in the pattern are included in the result: >>> re.split('\W+', 'Words, words, words.') ['Words', 'words', 'words', ''] >>> re.split('(\W+)', 'Words, words, words.') ['Words', ', ', 'words', ', ', 'words', '.', ''] This is considered useful functionality to have [3], though not necesarily by default. Consider a simple parser [4] which will need access to the tokens for processing. Other implementations such as Racket [3], Chicken [5], and Perl do not return the captured groups in their result. If there were two separate functions (or one function with an optional argument controlling the output) then the user could have a single regexp perform both the task of just splitting and the task of extracting the tokens. [6] [2] http://docs.python.org/library/re.html#re.split [3] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00257.html [4] http://80.68.89.23/2003/Oct/26/reSplit/ [5] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00249.html [6] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00266.html * Empty strings Some implementations (e.g. Chicken and Perl) drop (some) empty strings from their result. In the case of Perl this is likely due to making things "nice" for the user in the majority case, but it is hard to revert this. [7] As per the example of `string-split', having empty strings in the result is useful to keep track of which "field" is which. In Scheme, if the empty strings are not desired, it is trivial to remove them: (filter (negate string-null?) lst) [7] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00269.html * Naming > Also, to me the name seems unintuitive -- it is STR being split, not > RE -- perhaps this can be folded in to the existing string-split > function. [8] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00245.html Hopefully I have not missed out anything important :-) Anyway, what do people think of this proposal which tries to address that whole discussion: * [Vanilla `string-split' expanded to support the CHAR_PRED semantics of `string-index' et al.] * New function `string-explode' similar to `string-split' but returns the deliminators in it's result. * Regex module replaces both of these with regexp-enhanced versions. Thus: scheme@(guile-user)> ;; with a char predicate scheme@(guile-user)> (string-split "123+456*/" (negate char-numeric?)) $8 = ("123" "456" "" "") scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?)) $9 = ("123" "+" "456" "*" "" "/" "") scheme@(guile-user)> ;; with a regular expression scheme@(guile-user)> (use-modules (ice-9 regex)) scheme@(guile-user)> (define rx (make-regexp "([^0-9])")) scheme@(guile-user)> (string-split "123+456*/" rx) $10 = ("123" "456" "" "") scheme@(guile-user)> ;; didn't want empty strings scheme@(guile-user)> (filter (negate string-null?) $10) $11 = ("123" "456") scheme@(guile-user)> (string-explode "123+456*/" rx) $12 = ("123" "+" "456" "*" "" "/" "") and so on. I'm happy to throw together a patch for the above, however, would like some feedback first :-) Regards