Re: RFC: add a string-desc module

2023-03-28 Thread Bruno Haible
Simon Josefsson wrote:
> I think this is a useful contribution,

Thanks.

> however I see two deal-breakers
> for having it in gnulib -- both related to use in libraries.  I think
> string helpers types/functions like this is useful not only in
> applications but also in libraries.  Thus:
> 
>  1) License - there really isn't much novelty here, how about making
>  this public domain or LGPLv2+?

Not public domain — it does not protect the user from patent claims.

Not MIT license — I don't intend to make gifts to proprietary software
vendors. It's bad enough that some companies ignore the requirements
of the GPL. 

I've put the core module under LGPLv3+.

If you want it under LGPLv2+, it would be OK for my part, but we would
have to relax the 'memrchr' module to LGPLv2+ first.

>  2) Applicability to use in a library - using x*alloc and abort is
>  frowned upon in libraries.  Libraries should return error codes on
>  expected errors (and I argue memory allocation failure is an expected
>  error), and not cause application exits.

Done by separating library-safe memory allocations and checked memory
allocations into separate modules.

> One way to resolve 2) is to have two variants of this functionality: one
> low-level variant that doesn't abort the application on errors, and one
> high-level variant that behaves like your implementation.  The
> high-level variant could depend on the low-level variant, but that's not
> essential.

Yes, that's how I did it, for the most part. I couldn't do this so easily
for the string_desc_concat function, though, due to varargs.

Bruno






Re: RFC: add a string-desc module

2023-03-28 Thread Bruno Haible
Paul Eggert wrote:
> > I'll add a comment regarding printf with the "%.*s" directive.
> 
> That works only if the string lacks NULs

Ouch, indeed.

> and its length fits into int, 
> and one must also convert the idx_t length to int (e.g., via a cast 
> which I find tricky).

I've now documented that "%.*s" is NOT the solution.

> Although these limitations could be documented, it 
> might also be good to have an API like quotearg to generate a quoted or 
> quotable string that can be printed with plain %s.

Good point. I've added wrappers around the quotearg functions. Fortunately,
most of the quotearg functions already have a *_mem variant that was designed
precisely for this case.

Bruno






Re: RFC: add a string-desc module

2023-03-27 Thread Simon Josefsson via Gnulib discussion list
Bruno Haible  writes:

>   struct
>   {
> size_t nbytes;
> char * data;
>   }
>
> I propose to add a module that adds such a type, together with elementary
> functions that work on them.

I think this is a useful contribution, however I see two deal-breakers
for having it in gnulib -- both related to use in libraries.  I think
string helpers types/functions like this is useful not only in
applications but also in libraries.  Thus:

 1) License - there really isn't much novelty here, how about making
 this public domain or LGPLv2+?

 2) Applicability to use in a library - using x*alloc and abort is
 frowned upon in libraries.  Libraries should return error codes on
 expected errors (and I argue memory allocation failure is an expected
 error), and not cause application exits.

What do you think?

One way to resolve 2) is to have two variants of this functionality: one
low-level variant that doesn't abort the application on errors, and one
high-level variant that behaves like your implementation.  The
high-level variant could depend on the low-level variant, but that's not
essential.

/Simon


signature.asc
Description: PGP signature


Re: RFC: add a string-desc module

2023-03-25 Thread Paul Eggert

On 2023-03-25 04:49, Bruno Haible wrote:


I'll add a comment regarding printf with the "%.*s" directive.


That works only if the string lacks NULs and its length fits into int, 
and one must also convert the idx_t length to int (e.g., via a cast 
which I find tricky). Although these limitations could be documented, it 
might also be good to have an API like quotearg to generate a quoted or 
quotable string that can be printed with plain %s.




Re: RFC: add a string-desc module

2023-03-25 Thread Vivien Kraus
Hello!

I frequently use ad-hoc code for this, however in library code, in
which xmalloc is not much used.

I learn new gnulib things primarily from the manual. Do you plan to
document it there?

Le vendredi 24 mars 2023 à 22:50 +0100, Bruno Haible a écrit :
> /* Return a copy of string S, as a NUL-terminated C string.  */
> extern char * string_desc_c (string_desc_t s);

Would it be appropriate to use the attribute module and mark this
ATTRIBUTE_DEALLOC_FREE?

Best regards,

Vivien



Re: RFC: add a string-desc module

2023-03-25 Thread Bruno Haible
Vivien Kraus wrote:
> I frequently use ad-hoc code for this, however in library code, in
> which xmalloc is not much used.

Good point. I'll need to duplicate the interface of the memory
allocating functions: one with 'x', that use xmalloc, and one without
'x', for use in libraries.

> I learn new gnulib things primarily from the manual. Do you plan to
> document it there?

Yes, sure. The reference documentation can stay in the .h file, but
and overview and general usage section belongs in the documentation.

> > /* Return a copy of string S, as a NUL-terminated C string.  */
> > extern char * string_desc_c (string_desc_t s);
> 
> Would it be appropriate to use the attribute module and mark this
> ATTRIBUTE_DEALLOC_FREE?

Good point, yes. Will do!

Thanks for your review and remarks.

Bruno






Re: RFC: add a string-desc module

2023-03-25 Thread Vivien Kraus
Le vendredi 24 mars 2023 à 19:20 -0400, Jeffrey Walton a écrit :
>  The type that I'm proposing does not have NUL byte appended to the
> data
> > always and automatically, because I think it is more important to
> > have a
> > string_desc_substring function that does not cause memory
> > allocation,
> > than to have string_desc_c function (conversion to 'char *') that
> > does
> > not cause memory allocation.
> 
> I would take caution if not including a NULL. A natural thing to want
> to do is print a string, and C-based routines usually expect a
> terminating NULL.
> 
> Also, if you initialize the struct, then the allocated string will
> likely include a terminating NULL. I understand the size member will
> omit the NULL, but it will be present anyways in the string. (Unless
> you do something ugly, like spell out the characters of the string).

>From what I understand, the proposed substring function cannot add a
NUL byte without doing a copy first.

Vivien



Re: RFC: add a string-desc module

2023-03-25 Thread Bruno Haible
Paul Eggert wrote:
> >struct
> >{
> >  size_t nbytes;
> >  char * data;
> >}
> 
> One minor comment: use idx_t instead of size_t, for the usual reasons.

Right, done. Thanks for the reminder.

> Also it might be a bit more efficient to put the pointer first.

On some CPUs probably, but not on others. Unless it's a clear win, I prefer
to avoid such code changes. The entire struct fits into a cache line anyway.

Even an attribute _Alignas(2*sizeof(long)) would only help on NetBSD, IIRC,
because for heap-allocated data, 2*sizeof(long) is already the default
alignment on most platforms.

Bruno






Re: RFC: add a string-desc module

2023-03-25 Thread Bruno Haible
Jeffrey Walton wrote:
> A natural thing to want
> to do is print a string, and C-based routines usually expect a
> terminating NULL.

I'll add a comment regarding printf with the "%.*s" directive.

> Also, if you initialize the struct, then the allocated string will
> likely include a terminating NULL. I understand the size member will
> omit the NULL, but it will be present anyways in the string.

No; it depends where the 'char *' comes from. If it is a pointer into
a piece of memory read through read_file, for example, there will be
no NUL terminator.

Also, in C you can write
  char buf[4] = "abcd";
which does not add a NUL.

> A length prefixed string may be a good idea.

https://github.com/antirez/sds does it like this. But again, this
does not allow for an allocation-free substring function.

> So if you are going to add the "string descriptor", then I hope you
> add some functions to make it easier for less experienced folks to
> write safer code.

I believe all these functions are already in the proposal.

> Also see libbsd's stringlist.h for some inspiration,
> https://cgit.freedesktop.org/libbsd/tree/include/bsd/stringlist.h .

This is unrelated, AFAICS. It's not about a string, but about an
extensible array of strings.

Bruno






Re: RFC: add a string-desc module

2023-03-24 Thread Jeffrey Walton
On Fri, Mar 24, 2023 at 5:50 PM Bruno Haible  wrote:
>
> In most application areas, it is not a problem if strings cannot contain NUL
> bytes, and thus the C type 'char *' with its NUL terminator is well usable.
>
> In areas where strings with embedded NUL bytes need to be handled, the common
> approach is to use a 'char * data' pointer together with a 'size_t nbytes'
> size. This works fine in code that constructs or manipulates strings with
> embedded NUL bytes. But when it comes to *storing* them, for example in an
> array or as key or value of a hash table, one needs a type that combines these
> two fields:
>
>   struct
>   {
> size_t nbytes;
> char * data;
>   }
>
> I propose to add a module that adds such a type, together with elementary
> functions that work on them.
>
> Such a type was long known as a "string descriptor" in VMS. It's also known
> as basic_string_view in C++, or as String in Java.
>
> The type that I'm proposing does not have NUL byte appended to the data
> always and automatically, because I think it is more important to have a
> string_desc_substring function that does not cause memory allocation,
> than to have string_desc_c function (conversion to 'char *') that does
> not cause memory allocation.

I would take caution if not including a NULL. A natural thing to want
to do is print a string, and C-based routines usually expect a
terminating NULL.

Also, if you initialize the struct, then the allocated string will
likely include a terminating NULL. I understand the size member will
omit the NULL, but it will be present anyways in the string. (Unless
you do something ugly, like spell out the characters of the string).

> The type that I'm proposing does not have two distinct fields
> nbytes_used and nbytes_allocated. Such a type, e.g. [1] attempts to
> cover the use-case of accumulating a string as well. But
>   - The Java experience with String vs. StringBuffer/StringBuilder
> shows that it is cleaner to separate the two use cases.
>   - For the use-case of accumulating a string, C programmers have been using
> ad-hoc code with n_used and n_allocated for a long time; there is
> no need for anything else (except for lazy people who want C to be
> a scripting language).
>
> The type that I'm proposing also does not have fields for heap management,
> such as a 'bool heap' [2] or a reference count. That's because I think that
>   - managing the allocated memory of a data structure is a different
> problem than that of representing a string, and it can be achieved
> with data outside the string descriptor,
>   - Such a field would make it wrong to simply assign a string descriptor
> to a variable.
>
> Please let me know what you think: Does this have a place in Gnulib? (Or
> should it stay in GNU gettext, where I need it for the Perl parser?)

A length prefixed string may be a good idea. It could also help with
safer string handling functions and efficient operations on a string
because length is already available.

So if you are going to add the "string descriptor", then I hope you
add some functions to make it easier for less experienced folks to
write safer code.

> [1] https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt
> [2] https://github.com/maxim2266/str

Also see libbsd's stringlist.h for some inspiration,
https://cgit.freedesktop.org/libbsd/tree/include/bsd/stringlist.h .

Jeff



Re: RFC: add a string-desc module

2023-03-24 Thread Paul Eggert

On 2023-03-24 14:50, Bruno Haible wrote:

   struct
   {
 size_t nbytes;
 char * data;
   }


One minor comment: use idx_t instead of size_t, for the usual reasons.

Also it might be a bit more efficient to put the pointer first.



RFC: add a string-desc module

2023-03-24 Thread Bruno Haible
In most application areas, it is not a problem if strings cannot contain NUL
bytes, and thus the C type 'char *' with its NUL terminator is well usable.

In areas where strings with embedded NUL bytes need to be handled, the common
approach is to use a 'char * data' pointer together with a 'size_t nbytes'
size. This works fine in code that constructs or manipulates strings with
embedded NUL bytes. But when it comes to *storing* them, for example in an
array or as key or value of a hash table, one needs a type that combines these
two fields:

  struct
  {
size_t nbytes;
char * data;
  }

I propose to add a module that adds such a type, together with elementary
functions that work on them.

Such a type was long known as a "string descriptor" in VMS. It's also known
as basic_string_view in C++, or as String in Java.

The type that I'm proposing does not have NUL byte appended to the data
always and automatically, because I think it is more important to have a
string_desc_substring function that does not cause memory allocation,
than to have string_desc_c function (conversion to 'char *') that does
not cause memory allocation.

The type that I'm proposing does not have two distinct fields
nbytes_used and nbytes_allocated. Such a type, e.g. [1] attempts to
cover the use-case of accumulating a string as well. But
  - The Java experience with String vs. StringBuffer/StringBuilder
shows that it is cleaner to separate the two use cases.
  - For the use-case of accumulating a string, C programmers have been using
ad-hoc code with n_used and n_allocated for a long time; there is
no need for anything else (except for lazy people who want C to be
a scripting language).

The type that I'm proposing also does not have fields for heap management,
such as a 'bool heap' [2] or a reference count. That's because I think that
  - managing the allocated memory of a data structure is a different
problem than that of representing a string, and it can be achieved
with data outside the string descriptor,
  - Such a field would make it wrong to simply assign a string descriptor
to a variable.

Please let me know what you think: Does this have a place in Gnulib? (Or
should it stay in GNU gettext, where I need it for the Perl parser?)

Bruno

[1] https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt
[2] https://github.com/maxim2266/str
/* GNU gettext - internationalization aids
   Copyright (C) 2023 Free Software Foundation, Inc.

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program.  If not, see .  */

/* Written by Bruno Haible , 2023.  */

#ifndef _STRING_DESC_H
#define _STRING_DESC_H 1

/* Get size_t, ptrdiff_t.  */
#include 

/* Get bool.  */
#include 


#ifdef __cplusplus
extern "C" {
#endif


/* Type describing a string that may contain NUL bytes.
   It's merely a descriptor of an array of bytes.  */
typedef struct string_desc_t string_desc_t;
struct string_desc_t
{
  size_t nbytes;
  char *data;
};

/* String descriptors can be passed and returned by value.  */


/*  Side-effect-free operations on string descriptors  */

/* Return the length of the string S.  */
extern size_t string_desc_length (string_desc_t s);

/* Return the byte at index I of string S.
   I must be < length(S).  */
extern char string_desc_char_at (string_desc_t s, size_t i);

/* Return a read-only view of the bytes of S.  */
extern const char * string_desc_data (string_desc_t s);

/* Return true if S is the empty string.  */
extern bool string_desc_is_empty (string_desc_t s);

/* Return true if S starts with PREFIX.  */
extern bool string_desc_startswith (string_desc_t s, string_desc_t prefix);

/* Return true if S ends with SUFFIX.  */
extern bool string_desc_endswith (string_desc_t s, string_desc_t suffix);

/* Return > 0, == 0, or < 0 if A > B, A == B, A < B.
   This uses a lexicographic ordering, where the bytes are compared as
   'unsigned char'.  */
extern int string_desc_cmp (string_desc_t a, string_desc_t b);

/* Return the index of the first occurrence of C in S,
   or -1 if there is none.  */
extern ptrdiff_t string_desc_index (string_desc_t s, char c);

/* Return the index of the last occurrence of C in S,
   or -1 if there is none.  */
extern ptrdiff_t string_desc_last_index (string_desc_t s, char c);

/* Return the index of the first occurrence of NEEDLE in HAYSTACK,
   or -1 if there is none.  */
extern ptrdiff_t