add regexp-split: a summary and new proposal

Daniel Hartwig Fri, 30 Dec 2011 21:54:42 -0800

An attempt to summarize the pertinent points of the thread [1].

[1] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00241.html


* Semantics, generally

  `regexp-split' is similar to `string-split'.  However, between
  various implementations the semantics vary over the following two
  points.  It is important to consider appropriate compatability with
  these other implementations whilst still offering the user a good
  set of functionality.

* Captured groups

  The Python [2] implementation contains unique semantics whereby the
  text of any captured groups in the pattern are included in the
  result:

  >>> re.split('\W+', 'Words, words, words.')
  ['Words', 'words', 'words', '']
  >>> re.split('(\W+)', 'Words, words, words.')
  ['Words', ', ', 'words', ', ', 'words', '.', '']

  This is considered useful functionality to have [3], though not
  necesarily by default.  Consider a simple parser [4] which will need
  access to the tokens for processing.

  Other implementations such as Racket [3], Chicken [5], and Perl do not
  return the captured groups in their result.

  If there were two separate functions (or one function with an
  optional argument controlling the output) then the user could have a
  single regexp perform both the task of just splitting and the task
  of extracting the tokens. [6]

  [2] http://docs.python.org/library/re.html#re.split
  [3] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00257.html
  [4] http://80.68.89.23/2003/Oct/26/reSplit/
  [5] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00249.html
  [6] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00266.html

* Empty strings

  Some implementations (e.g. Chicken and Perl) drop (some) empty
  strings from their result.  In the case of Perl this is likely due
  to making things "nice" for the user in the majority case, but it is
  hard to revert this. [7]

  As per the example of `string-split', having empty strings in the
  result is useful to keep track of which "field" is which.

  In Scheme, if the empty strings are not desired, it is trivial to
  remove them:
   (filter (negate string-null?) lst)

  [7] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00269.html

* Naming

  > Also, to me the name seems unintuitive -- it is STR being split, not
  > RE -- perhaps this can be folded in to the existing string-split
  > function.

  [8] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00245.html


Hopefully I have not missed out anything important :-)


Anyway, what do people think of this proposal which tries to address
that whole discussion:

* [Vanilla `string-split' expanded to support the CHAR_PRED
  semantics of `string-index' et al.]

* New function `string-explode' similar to `string-split' but returns
  the deliminators in it's result.

* Regex module replaces both of these with regexp-enhanced versions.

Thus:

scheme@(guile-user)> ;; with a char predicate
scheme@(guile-user)> (string-split "123+456*/" (negate char-numeric?))
$8 = ("123" "456" "" "")
scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$9 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> ;; with a regular expression
scheme@(guile-user)> (use-modules (ice-9 regex))
scheme@(guile-user)> (define rx (make-regexp "([^0-9])"))
scheme@(guile-user)> (string-split "123+456*/" rx)
$10 = ("123" "456" "" "")
scheme@(guile-user)> ;; didn't want empty strings
scheme@(guile-user)> (filter (negate string-null?) $10)
$11 = ("123" "456")
scheme@(guile-user)> (string-explode "123+456*/" rx)
$12 = ("123" "+" "456" "*" "" "/" "")

and so on.

I'm happy to throw together a patch for the above, however, would like
some feedback first :-)


Regards

add regexp-split: a summary and new proposal

Reply via email to