Greets,

Most languages provide support for extracting substrings from strings via
either methods, builtin functions, or slicing operators.  For the Clownfish
String class, a method-based implementation such as we have now is most
suitable, but it's worth reviewing the method's signature and behavior.

Python supports substrings via a slice operator.  The arguments are both
indexes, and negative numbers count backwards from the end.  Arguments out of
range are clamped to 0 and the length of the string.

    abcd  = "abcd"
    c     = abcd[2:3]
    c     = abcd[-2:-1]
    ab    = abcd[:2]
    cd    = abcd[2:]
    cd    = abcd[2:10]
    empty = abcd[2:-10]

Ruby's String class supports substrings via a subscripting/slicing operator
which is syntactic sugar for the String#[] method.  It takes a wide variety of
arguments, including ranges and regexes.  The same behaviors are also
available via String#slice.

We'll need to offer a single signature (because we don't support overloading
by signature), and it will need to accept integer primitives rather than
objects in order to be useful from C, so we can ignore the more fancy Ruby
behaviors.  Nevertheless, it's useful to note the following:

*   Two integers are interpreted as `start` and `length`.
*   Negative start values count backwards from the end of the string.
*   An out-of-range start value before the beginning of the string or beyond
    its end returns `nil`, even if part of the string would lie within the
    specified bounds.
*   Out of range lengths beyond the end of the string are clamped.
*   Supplying a negative length returns `nil`.

    abcd  = "abcd"
    c     = abcd[2, 1]
    c     = abcd[-2, 1]
    cd    = abcd[2, 2]
    cd    = abcd[2, 10]
    _nil  = abcd[2, -1]
    empty = abcd[0, 0]
    empty = abcd[-4, 0]
    _nil  = abcd[-5, 0]
    _nil  = abcd[-5, 4]

Perl's `substr` operator is complex: it can be even be used as an lvalue in an
assignment statement to mutate string content.  We're only concerned with the
first two arguments: OFFSET, and an optional LENGTH.  Both OFFSET and LENGTH
may be negative, in which case they are interpreted as indexes counted back
from the end of the string.

If the substring lies entirely outside the string, `substr` warns and returns
undef.

    $abcd  = "abcd";
    $c     = substr($abcd, 2, 1);
    $c     = substr($abcd, -2, -1);
    $cd    = substr($abcd, 2);
    $empty = substr($abcd, 0, 0);
    $empty = substr($abcd, -4, 0);
    $undef = substr($abcd, -5, 0);

Java provides two methods named `substring`, one which takes a single
`beginIndex`, and another which takes both `beginIndex` and `endIndex`.
Arguments out of bounds trigger exceptions.

    String abcd  = "abcd";
    String cd    = abcd.substring(2);
    String c     = abcd.substring(2, 3);
    abcd.substring(-1);       // error
    abcd.substring(-1, 0);    // error

C# provides two `Substring` methods, one which takes a single
`int startIndex`, and one which takes both `int startIndex` and `int length`.
Arguments out of bounds trigger exceptions.

    string abcd = "abcd";
    string cd   = abcd.Substring(2);
    string c    = abcd.Substring(2, 1);

...

I propose the following API:

    /** Extracts a substring.
     *
     * If the specified substring lies partially or completely outside the
     * boundaries of the string, only the portion within the string will be
     * returned.  Supplying a negative value for `length` returns an empty
     * string.
     *
     * @param offset Start offset from the top of the string, in code points.
     * @param length The length of the substring, in code points.
     */
    public incremented String*
    Substring(String *self, int64_t offset = 0,
              int64_t length = Ox7fffffffffffffffffff);

(The default value for length is equal to INT64_MAX.  (The Clownfish header
language currently supports hex literals but not symbolic constants.)

Changes from the old SubString method:

*   Change of capitalization from "SubString" to "Substring".
*   Parameter "len" renamed to "length".
*   Parameters are now int64_t rather than size_t.
*   Parameters now have defaults.

Rationales:

We want to emphasize the semantics of a String as a sequence of code points
and avoid tying parameters to a specific encoding.  "Substring", "offset" and
"length" do not imply random access as would "Slice", "start_index", and
"end_index".

By returning an empty string when provided with out-of-bounds values for
`offset` and `length`, we guarantee that the method never throws exceptions
and always returns a valid String object.  That makes it easier to use in
complex expression idioms without the need for sophisticated error handling.

We opt for simple handling of negative values because providing divergent
behavior is only a convenience of dubious value which makes code where
Substring() is invoked harder to grok.

Marvin Humphrey

Reply via email to