Re: G_UTF8String: Boxed Type Proposal

2016-03-23 Thread Behdad Esfahbod
On Mon, Mar 21, 2016 at 3:30 PM, Randall Sawyer 
wrote:

> Frankly, the use of the term "character" when referring to a "UTF-8
> encoded Unicode code point" was for me a source of confusion


A character means a "Unicode character".  That's independent of encoding,
so, no, it does NOT mean "UTF-8 encoded Unicode code point".


> when I leapt to the conclusion of the unmet need of a UTF-8-length-aware
> wrapped string type - be it called "G_UTF8String" or "GUString".
>
> I recommend that all Glib documentation be rewritten such that throughout
> all descriptions of g_utf8_*() functions, the parlance "character" be
> replaced with "UTF-8 code point sequence" or equivalent terminology.
>
> Thank you.
>
>
>
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> https://mail.gnome.org/mailman/listinfo/gtk-devel-list
>



-- 
behdad
http://behdad.org/
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-21 Thread Randall Sawyer
Frankly, the use of the term "character" when referring to a "UTF-8 
encoded Unicode code point" was for me a source of confusion when I 
leapt to the conclusion of the unmet need of a UTF-8-length-aware 
wrapped string type - be it called "G_UTF8String" or "GUString".


I recommend that all Glib documentation be rewritten such that 
throughout all descriptions of g_utf8_*() functions, the parlance 
"character" be replaced with "UTF-8 code point sequence" or equivalent 
terminology.


Thank you.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-21 Thread Randall Sawyer

Thank you once again to all who have responded.

I have changed my mind.

I DO grasp the nature of responders' objections.

My understanding has now reached a "tipping point".

What is the tipping point?

On 03/21/2016 04:30 PM, Behdad Esfahbod wrote:

I like to voice my opinion as well:

  - Bundling data and its length in a boxed type is useful, but that's 
gblob,


  - Bundling number-of-Unicode-character is rarely useful indeed,

  - A string API that would require any changes to the string content 
to go through editing function calls is painful and will remain unused,


I also have a piece of a more personal opinion:  Many processes that 
simply *reject* invalid Unicode text are useless in many situations.  
For example, gedit used to refuse to open a file if it had even a 
single invalidly-encoded byte.  I find that annoyingly limited.  Same 
thing about Pango.  Fortunately, both have been fixed for many years now.



behdad

On Mon, Mar 21, 2016 at 6:32 AM, Matthias Clasen 
> wrote:


On Fri, Mar 18, 2016 at 9:57 AM, Randall Sawyer
>
wrote:


> 2) If the former is true - which it is - then the developer will
need to
> call g_utf8_strlen() to determine if there are multi-byte
sequences to
> navigate - and if there are - g_utf8_offset_to_pointer() to
locate the array
> index. Doesn't this increase processing demand?

It does. But whether that is a problem (in general, or for your
particular use case) can only be answered by  profiling. My theory is
that you won't be able to notice this on the profile at all, unless
all your application does is constantly operating on large amounts of
text. In which case, you really shouldn't be using GString to begin
with...



Matthias, I comprehend what you are saying here.

As Christian pointed out recently 
(https://mail.gnome.org/archives/gtk-devel-list/2016-March/msg00037.html), 
"DRY alone is not a sufficient argument."



> 3) Wouldn't it be helpful to keep track of how many code points
> ("characters")are stored in the GString - a number which may be
less than
> the value of GString.len - without needing to call
g_utf8_strlen() each time
> to find out?
> 4) Would my efforts be better spent editing patches of
"gstring.h" and
> "gstring.c" - or - to proceed as I am to introduce a parallel
alternative?

I think we haven't gotten past the 'what is the problem you are trying
to solve - and is it a problem in the first place ?' part yet.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org 
https://mail.gnome.org/mailman/listinfo/gtk-devel-list





The tipping point is the function g_utf8_normalize() - which is called 
by objects which DO possess a length-of-string in units of UTF 
8-code-points ("characters" in Glib parlance).


If my proposed idea were to be adopted in a useful way - then every call 
to any g_utf8_*() function would require that it be wrapped in a 
g_ustring_*() [previously g_utf8_string_*()] function in order for 
GUString [previously G_UTF8String] to be truly useful.


Time to move on.

Along the way - however - I have come up with two functions which I will 
be proposing and which may very well be useful in a number of certain cases:


g_utf8_unilen() - which measures the length of string in UTF-8 sequences 
("characters") primarily and in non-nul bytes secondarily


g_utf8_offset_to_pointer_sized () - which optimizes its return value by 
by first comparing byte length to UTF-8 length [for the cases when these 
are both known] - opting for pointer arithmetic when equal - and then 
compares UTF-8 offset to UTF-8 length in order to decide whether to 
parse the first 3/4 of the last 1/4 when calling g_utf8_offset_to_pointer()


These last two, I will definitely be submitting as a patch.



--
behdad
http://behdad.org/


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-21 Thread Behdad Esfahbod
I like to voice my opinion as well:

  - Bundling data and its length in a boxed type is useful, but that's
gblob,

  - Bundling number-of-Unicode-character is rarely useful indeed,

  - A string API that would require any changes to the string content to go
through editing function calls is painful and will remain unused,

I also have a piece of a more personal opinion:  Many processes that simply
*reject* invalid Unicode text are useless in many situations.  For example,
gedit used to refuse to open a file if it had even a single
invalidly-encoded byte.  I find that annoyingly limited.  Same thing about
Pango.  Fortunately, both have been fixed for many years now.


behdad

On Mon, Mar 21, 2016 at 6:32 AM, Matthias Clasen 
wrote:

> On Fri, Mar 18, 2016 at 9:57 AM, Randall Sawyer
>  wrote:
>
>
> > 2) If the former is true - which it is - then the developer will need to
> > call g_utf8_strlen() to determine if there are multi-byte sequences to
> > navigate - and if there are - g_utf8_offset_to_pointer() to locate the
> array
> > index. Doesn't this increase processing demand?
>
> It does. But whether that is a problem (in general, or for your
> particular use case) can only be answered by  profiling. My theory is
> that you won't be able to notice this on the profile at all, unless
> all your application does is constantly operating on large amounts of
> text. In which case, you really shouldn't be using GString to begin
> with...
>
> > 3) Wouldn't it be helpful to keep track of how many code points
> > ("characters")are stored in the GString - a number which may be less than
> > the value of GString.len - without needing to call g_utf8_strlen() each
> time
> > to find out?
> > 4) Would my efforts be better spent editing patches of "gstring.h" and
> > "gstring.c" - or - to proceed as I am to introduce a parallel
> alternative?
>
> I think we haven't gotten past the 'what is the problem you are trying
> to solve - and is it a problem in the first place ?' part yet.
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> https://mail.gnome.org/mailman/listinfo/gtk-devel-list
>



-- 
behdad
http://behdad.org/
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-21 Thread Matthias Clasen
On Fri, Mar 18, 2016 at 9:57 AM, Randall Sawyer
 wrote:


> 2) If the former is true - which it is - then the developer will need to
> call g_utf8_strlen() to determine if there are multi-byte sequences to
> navigate - and if there are - g_utf8_offset_to_pointer() to locate the array
> index. Doesn't this increase processing demand?

It does. But whether that is a problem (in general, or for your
particular use case) can only be answered by  profiling. My theory is
that you won't be able to notice this on the profile at all, unless
all your application does is constantly operating on large amounts of
text. In which case, you really shouldn't be using GString to begin
with...

> 3) Wouldn't it be helpful to keep track of how many code points
> ("characters")are stored in the GString - a number which may be less than
> the value of GString.len - without needing to call g_utf8_strlen() each time
> to find out?
> 4) Would my efforts be better spent editing patches of "gstring.h" and
> "gstring.c" - or - to proceed as I am to introduce a parallel alternative?

I think we haven't gotten past the 'what is the problem you are trying
to solve - and is it a problem in the first place ?' part yet.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Matthias Clasen
Sure, code point works too. Anyway, enough with the ontology, we're
not a standards body

I still don't think that we need a utf8-string datatype.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Christian Hergert
On 03/19/2016 02:04 PM, Randall Sawyer wrote:
>> It's possible you are focusing on implementation before measuring the
>> problem. DRY alone is not a sufficient argument.
> 
> "DRY" is not a term I know - or at least in the way you are using it
> here.

https://en.wikipedia.org/wiki/Don't_repeat_yourself

> 2) Tags for text containing the digits and cardinal directions
> specified as editable. Tags containing other symbols - uneditable.

I suspect with all the new CSS work this could be implemented as a
series of entries without frames and a box drawing the entry frame.

-- Christian
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Randall Sawyer

On 03/19/2016 04:09 PM, Christian Hergert wrote:

It's possible you are focusing on implementation before measuring the
problem. DRY alone is not a sufficient argument.


"DRY" is not a term I know - or at least in the way you are using it here.


One topic I'm interested in covering at the hackfest in June (if there
is sufficient interest) is a general purpose text buffer for Gio.

...

Some notable things I want to make a modern textview:

...

  2) Interface implementations that use append only change buffers.

...

One thing you'll notice when you combine the above is that you've
basically written a database.

-- Christian


Thanks for sharing that, Christian. Sounds interesting to look into!

I do understand what you have been getting at with the "append only" 
buffers.


What prompted my proposal in the first place was that I find GtkEntry 
and supporting objects to be rather limiting.
I have been working towards developing a general way to extend the 
features of the single-line entry to include:


1) easily add user-specified formatting and validation

2) affixes which may or may not be selectable

3) rules for how the text is converted and into which MIME when dragging 
or copying


AND for these to be easily user-defined.

For some exotic example: "76°54′32″N-12°34′56″W"

1) valid only for regex 
"^\d{1,2}\302\260\d{1,2}\342\200\262\d{1,2}\342\200\263[NSns]-\d{1,2}\302\260\d{1,2}\342\200\262\d{1,2}\342\200\263[EWew]$"


2) Tags for text containing the digits and cardinal directions specified 
as editable. Tags containing other symbols - uneditable.


3) drags as struct coord_struct {enum_t COORDS_POLAR, double 76.908889, 
double 12.58} to a map widget with "coords" property set to enum_t 
COORDS_MERCATOR which then calls coord_struct_convert (struct 
coord_struct *dst, struct coord_struct *src); etc. ... in order to 
locate the pixel on the map.


Eventually, I could see something like the introduction of GtkEntryMark 
and GtkEntryTag as aiding (2).


Before embarking further however, I felt that the establishment of the 
"length-aware UTF-8 string boxed type" could greatly reduce the overhead 
of redundant code. Perhaps what is called for instead is an Entry Buffer 
and Widget which use segments and append-only-buffers as GtkTextBuffer 
and GtkTextView do?


I will be on the lookout for more news about your vision!

--Randall


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list





___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Christian Hergert
On 03/19/2016 12:25 PM, Randall Sawyer wrote:
> 
> If there already were such a structure, then it could already have been
> employed by existing objects and structures such as GtkEntryBuffer and
> PangoLayout - to name two - eliminating the need for extra lines of
> redundant code. In fact - as I envision it - the entirety of struct
> _GtkEntryBuffer IS a GUString - and all of the procedures which operate
> upon instances of that structure could be performed by g_ustring_*()
> functions!

It's possible you are focusing on implementation before measuring the
problem. DRY alone is not a sufficient argument.

> The emergence of such a structure may, IMHO, facilitate more rapid
> development of future structures and objects which could also benefit
> from having such a length-aware string object. That's all.

One topic I'm interested in covering at the hackfest in June (if there
is sufficient interest) is a general purpose text buffer for Gio.

While not a particular goal, it would allow us to pass content between
gtk and pango without requiring contiguous strings. But that is an awful
lot of pango work and Behdad is probably in the best situation to say if
it would even help. I suspect on *really* long lines it might, but
otherwise not.

Some notable things I want to make a modern textview:

 1) Interface that makes it possible to mmap() source contents
and/or a page cache for large file support. Being limited to
virtual address space like we are today is non-ideal.
(SQL dumps, log viewers, IRC backlog come to mind).
- another interface implementation could be a contiguous string,
  but i expect the same append only structure is still faster
  once you start editing.
 2) Interface implementations that use append only change buffers.
 3) Unlimited Undo/Redo tracking
 4) Ability to get a snapshot of the buffer for off-thread processing.
(Highlighting, glyph sizing, etc).
a. ability to do this without doing copies in most cases
 5) Ability to attach indexes to "GTextBuffer"
a. Number of code points could be one such line-based index
b. Line height calculations another
c. highlighting tags, completion word indexes, etc
d. index merging to merge off-thread calculations
e. This gets a little tricky when you want multiple views with
   different font sizes (like we do in Builder and Gedit)
 6) efficient slices for copy/paste

One thing you'll notice when you combine the above is that you've
basically written a database.

-- Christian
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Randall Sawyer

On 03/19/2016 02:57 PM, Emmanuele Bassi wrote:

Hi;

On 19 March 2016 at 18:03, Randall Sawyer <srandallsaw...@hushmail.me> 
wrote:
The concision of "GUString" over "G_UTF8String" reflects the 
concision of my

thoughts over what they were at the beginning of this thread.

Since you've brought it up multiple times, I wanted to ensure you
understood this particular point...

...

In general, especially for C developers, you're supposed to store
strings as NUL-terminated char*; for binary blobs, you should using a
uint8_t* with a length, instead. Those are the existing best practices
in the language, and are also used throughout the G* platform.

Ciao,
  Emmanuele.



I do understand that. Thank you.

What I am proposing is a means of combining a true string with its 
byte-length AND its utf8-length - thus eliminating the need for 
redundant calculations.


If there already were such a structure, then it could already have been 
employed by existing objects and structures such as GtkEntryBuffer and 
PangoLayout - to name two - eliminating the need for extra lines of 
redundant code. In fact - as I envision it - the entirety of struct 
_GtkEntryBuffer IS a GUString - and all of the procedures which operate 
upon instances of that structure could be performed by g_ustring_*() 
functions!


The emergence of such a structure may, IMHO, facilitate more rapid 
development of future structures and objects which could also benefit 
from having such a length-aware string object. That's all.



 Forwarded Message ----
Subject:    Re: G_UTF8String: Boxed Type Proposal
Date:   Sat, 19 Mar 2016 15:11:23 -0400
From:   Randall Sawyer <srandallsaw...@hushmail.me>
To: Emmanuele Bassi <eba...@gmail.com>



On 03/19/2016 02:57 PM, Emmanuele Bassi wrote:

Hi;

On 19 March 2016 at 18:03, Randall Sawyer <srandallsaw...@hushmail.me> wrote:

The concision of "GUString" over "G_UTF8String" reflects the concision of my
thoughts over what they were at the beginning of this thread.

Since you've brought it up multiple times, I wanted to ensure you
understood this particular point...

...

In general, especially for C developers, you're supposed to store
strings as NUL-terminated char*; for binary blobs, you should using a
uint8_t* with a length, instead. Those are the existing best practices
in the language, and are also used throughout the G* platform.

Ciao,
  Emmanuele.



I do understand that. Thank you.

What I am proposing is a means of combining a true string with its
byte-length AND its utf8-length - thus eliminating the need for
redundant calculations.

If there already were such a structure, then it could already have been
employed by existing objects and structures such as GtkEntryBuffer and
PangoLayout - to name two - eliminating the need for extra lines of
redundant code.

The emergence of such a structure may, IMHO, facilitate more rapid
development of future structures and objects which could also benefit
from having such a length-aware string object. That's all.



___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Florian Müllner
On Fri, Mar 18, 2016 at 2:57 PM Randall Sawyer 
wrote:

> how about the following modifications?
> Change "gstring.h":
> ...
> struct _GString
> {
>gchar  *str;
>gsize len;
>gsize allocated_len;
>gsize utf8_len;
> };
> ...
>

 Changing the size of a public struct is an ABI break, so this is not an
option for glib-2.x.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Emmanuele Bassi
Hi;

On 19 March 2016 at 18:03, Randall Sawyer  wrote:
> The concision of "GUString" over "G_UTF8String" reflects the concision of my
> thoughts over what they were at the beginning of this thread.

Since you've brought it up multiple times, I wanted to ensure you
understood this particular point...

GString is *not* a string type. It's a string builder type, heavily
modeled on StringBuilder in Java:

https://docs.oracle.com/javase/tutorial/java/data/buffers.html

GString is only meant to be used as a way to build strings from other
sources, not for storing or measuring strings.

The naming is a bit unfortunate, and has tricked various newcomers to
the G* platform libraries.

In general, especially for C developers, you're supposed to store
strings as NUL-terminated char*; for binary blobs, you should using a
uint8_t* with a length, instead. Those are the existing best practices
in the language, and are also used throughout the G* platform.

Ciao,
 Emmanuele.

-- 
https://www.bassi.io
[@] ebassi [@gmail.com]
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Randall Sawyer

On 03/19/2016 01:38 PM, Christian Hergert wrote:

On 03/19/2016 06:57 AM, Randall Sawyer wrote:

Some object classes - such as GtkEntryBuffer - store this value and
update it as text is inserted or deleted. That is efficient. The fact
that developers need to write equivalent code for each such class is
inefficient.

A string abstraction like the one you describe is not an efficient way
to do text processing, especially for interactive widgets.

...

Before we add new data structures to GLib, we like to have a solid use
case for which the data structure solves. So far, I haven't seen a
concrete problem for which this data structure would be the ideal fix.


Thank you, Christian.

I am appreciative of all of the feedback I have received in this thread. :-)

I am currently writing a formalized case for this proposal - including 
citing specific current source code which uses the Glib API. Next I will 
edit the code and documentation I have into a more presentable format to 
submit as a patch via git.gnome.org. When I have submitted the patch 
(not sure how long it takes me to get that far - as it is my first), I 
will post a thread on this mail list entitled "GUString: Boxed Type 
Proposal".


I am not motivated to be "right". I am instead motivated to discover the 
best solutions. For this, it is important to leave no stone unturned. 
All who have responded to this idea have certainly been helping to turn 
stones - and to help me to better define the question I have.


The concision of "GUString" over "G_UTF8String" reflects the concision 
of my thoughts over what they were at the beginning of this thread.


Thank you, All.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list





___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Matthias Clasen
On Thu, Mar 17, 2016 at 4:09 PM, Jasper St. Pierre
 wrote:
> The major issue is that "Unicode character" doesn't have a good
> definition. The most likely definition is a "Unicode code point",
> however, Windows uses "Unicode character" to mean a UTF-16 byte
> sequence, which means that any code point above the Basic Multilingual
> Plane is really composed of two "Unicode characters", which are, of
> course, surrogate pairs.

Terminology can certainly be confusing at times, but I think that a
Unicode character is a perfectly well-defined entity, non-withstanding
the fact that it can be represented in various encodings (a utf8
sequence, a ucs4 word, a utf-16 surrogate pair, etc).
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Christian Hergert
On 03/19/2016 06:57 AM, Randall Sawyer wrote:
> 
> Some object classes - such as GtkEntryBuffer - store this value and
> update it as text is inserted or deleted. That is efficient. The fact
> that developers need to write equivalent code for each such class is
> inefficient.

A string abstraction like the one you describe is not an efficient way
to do text processing, especially for interactive widgets.

It's much better to split things into two data-structures.

1) An append only buffer with all text content.
2) A pointer table with start:end tuples representing ranges in the
append only buffer.

And if you are doing a full text editing widget like GtkTextView:

3) Other necessary indexes are similar to #2, with interval trees. (Line
height, row calculations, format tags, etc)

This simplifies unlimited undo, mmap()'ing large input data, avoiding
large memmove()s and simplifying incremental utf-8 validations.

Before we add new data structures to GLib, we like to have a solid use
case for which the data structure solves. So far, I haven't seen a
concrete problem for which this data structure would be the ideal fix.

-- Christian
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Nicolas George
[ Replying a little randomly to this message. ]

Randall Sawyer:
> 3) Wouldn't it be helpful to keep track of how many code points
> ("characters")are stored in the GString - a number which may be less than
> the value of GString.len - without needing to call g_utf8_strlen() each time
> to find out?

IMnsHO, NO, definitely not.

To the people who want this feature: why do you want it? The octet length is
necessary to copy the string, store it in a file, send it to network.

But what use is the number of Unicode code points? Or their index in the
string? 

In my experience, the almost-only relevant treatment to an Unicode string is
to walk over, character by character, applying parsing or typographic
algorithms. Knowing how many code points, or even graphemes, were in a given
span or the whole string is almost always irrelevant.

Regards,

-- 
  Nicolas George


signature.asc
Description: Digital signature
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Simon McVittie
On 17/03/16 20:29, Matthias Clasen wrote:
> Terminology can certainly be confusing at times, but I think that a
> Unicode character is a perfectly well-defined entity, non-withstanding
> the fact that it can be represented in various encodings (a utf8
> sequence, a ucs4 word, a utf-16 surrogate pair, etc).

You mean a code point, then (that's basically what gunichar is). I think
the reason Unicode people are so pedantic about "code point" is because
a code point may or may not be what you actually mean when you say
"character", whereas it's rare that I see "code point" used with a
meaning other than its Unicode one.

More precisely, a Unicode code point is an abstract entity indexed by a
number, such as U+0041 LATIN CAPITAL LETTER A or U+262D HAMMER AND
SICKLE, which can only be concretely represented as some particular byte
sequence by passing it through an encoding like UCS-4, UTF-8 or
ISO-8859-1. Some encodings are more obvious than others, and in
particular non-Unicode encodings like ISO-8859-1 cannot represent every
Unicode code point.

-- 
Simon McVittie
Collabora Ltd. 

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Randall Sawyer

On 03/19/2016 03:41 AM, Errol van de l'Isle wrote:

Just to add my two cents worth as a user of glibmm.

Glib::usting uses g_utf8_pointer_to_offset() to obtain the length of
the string in characters in the method Glib::ustring::length. The
method Glib::ustring::bytes returns the length in bytes;

At no point does it store the number of UTF-8 characters as this would
be inefficient.

For simple string manipulation like inserting a string or character or
concatenating would require extra work to be done. The string needs to
be checked that it is still valid UTF-8 before the length is updated.
The next issue is what to do when the string becomes invalid UTF-8.
Doing this for every string operation will have a performance
implication. Imagine doing this in a loop inserting a byte from a
stream!

Checking at the end of all the operations or handing it over to GTK to
deal with the problems will be more efficient and less of a headache.


Thank you, Errol.

I understand that it would be inefficient to validate the string each time.

I picked up on this fact from Matthias Clasen's first response in this 
thread 
(https://mail.gnome.org/archives/gtk-devel-list/2016-March/msg00014.html):


"Every string we pass around in GLib and GTK+, and every char * in their 
APIs is expected to be in utf8."


My response to this 
(https://mail.gnome.org/archives/gtk-devel-list/2016-March/msg00015.html):


"Here is the vision: Once raw string data - or gunichar value - has been 
passed and validated into the construction of a "G_UTF8String" 
structure, then contents of two-or-more of these can be easily combined 
without the need for additional measuring or validating."


It is inefficient for functions which need to know the code-point length 
of a utf8 string to have to calculate that value each time it is needed. 
This is the current state of affairs.


Some object classes - such as GtkEntryBuffer - store this value and 
update it as text is inserted or deleted. That is efficient. The fact 
that developers need to write equivalent code for each such class is 
inefficient.


It is these two inefficiencies which I am addressing.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list



___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Randall Sawyer


On 03/17/2016 09:30 AM, Matthias Clasen wrote:

Hi Randall,

thanks for contributing!
Pleased to be of service! Looking forward to learning how folks work 
together in this community.

I believe that you haven't found such a proposal because most people
don't see much use in a separate boxed type for utf8 strings. Every
string we pass around in GLib and GTK+, and every char * in their APIs
is expected to be in utf8. The few exceptions to this rule are
explicitly documented.
There already is GString. It dynamically allocates its contents while 
keeping track of the number of bytes required - but not for the number 
of characters it contains.

The main reason you mention for wanting such a type is to do away with
the need for repeatedly calculating the character count. I think this
falls into the same category as the length of the string in bytes - C
doesn't have counted strings either, and expects you to just call
strlen() over and over again. In practice, most strings we're handling
are short enough for this to not be much of an issue.
For interactive applications which employ text-oriented widgets, there 
is a need to keep track of utf-8 character lengths for rendering 
purposes - text selection, etc. Each time this is called for, code needs 
to be written for such management. Take a look at gtkentrybuffer.c for 
example. I see a call for the provision of core code which handles this 
overhead repeatedly for these sorts of demands.

GLib already provides a number of utilities for dealing with utf8
strings in terms of characters, such as g_utf8_strlen,
g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
adding to that list, if there are glaring omissions.
As I mentioned above, there is GString with its limitations. My intent 
in presenting the possibility of "G_UTF8String" is to combine the 
dynamic allocation provided by GString while employing in the background 
these very utilities you mention.


Here is the vision: Once raw string data - or gunichar value - has been 
passed and validated into the construction of a "G_UTF8String" 
structure, then contents of two-or-more of these can be easily combined 
without the need for additional measuring or validating.


I have cloned a copy of glib-2.47.92. I am currently documenting the 
source code I have written.


I'll let you know when I have posted my first patch.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Matthias Clasen
On Wed, Mar 16, 2016 at 6:58 PM, Randall Sawyer
 wrote:
> I have a question at the end of this! Please answer if you think it will
> help.

Hi Randall,

thanks for contributing!

>
> I propose the development of a new boxed type for the Glib API named
> "G_UTF8String". I have searched through this mailing list's archives to see
> if anyone else has proposed anything similar. I have not found any.

I believe that you haven't found such a proposal because most people
don't see much use in a separate boxed type for utf8 strings. Every
string we pass around in GLib and GTK+, and every char * in their APIs
is expected to be in utf8. The few exceptions to this rule are
explicitly documented.

The main reason you mention for wanting such a type is to do away with
the need for repeatedly calculating the character count. I think this
falls into the same category as the length of the string in bytes - C
doesn't have counted strings either, and expects you to just call
strlen() over and over again. In practice, most strings we're handling
are short enough for this to not be much of an issue.

GLib already provides a number of utilities for dealing with utf8
strings in terms of characters, such as g_utf8_strlen,
g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
adding to that list, if there are glaring omissions.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Errol van de l'Isle
Just to add my two cents worth as a user of glibmm.

Glib::usting uses g_utf8_pointer_to_offset() to obtain the length of
the string in characters in the method Glib::ustring::length. The
method Glib::ustring::bytes returns the length in bytes;

At no point does it store the number of UTF-8 characters as this would
be inefficient.

For simple string manipulation like inserting a string or character or
concatenating would require extra work to be done. The string needs to
be checked that it is still valid UTF-8 before the length is updated.
The next issue is what to do when the string becomes invalid UTF-8.
Doing this for every string operation will have a performance
implication. Imagine doing this in a loop inserting a byte from a
stream!

Checking at the end of all the operations or handing it over to GTK to
deal with the problems will be more efficient and less of a headache.

On Fri, 2016-03-18 at 10:19 -0400, Randall Sawyer wrote:
> On 03/18/2016 10:10 AM, Florian Müllner wrote:
> > On Fri, Mar 18, 2016 at 2:57 PM Randall Sawyer  > mail.me> wrote:
> > > how about the following modifications?
> > > Change "gstring.h":
> > > ...
> > > struct _GString
> > > {
> > >    gchar  *str;
> > >    gsize len;
> > >    gsize allocated_len;
> > >    gsize utf8_len;
> > > };
> > > ...
> > > 
> >  Changing the size of a public struct is an ABI break, so this is
> > not an option for glib-2.x.
>  
> So, does that answer question 4?
> 
> Also - I just discovered that glibmm has a class Glib::ustring (https
> ://developer.gnome.org/glibmm/stable/classGlib_1_1ustring.html). I am
> going to take a look through its source to see what they have there.
> 
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> https://mail.gnome.org/mailman/listinfo/gtk-devel-list
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Jasper St. Pierre
The major issue is that "Unicode character" doesn't have a good
definition. The most likely definition is a "Unicode code point",
however, Windows uses "Unicode character" to mean a UTF-16 byte
sequence, which means that any code point above the Basic Multilingual
Plane is really composed of two "Unicode characters", which are, of
course, surrogate pairs.

This confusion also extends to JavaScript, which composes its String
type of "characters" which are actually UTF-16 values. You can see
this with astral plane characters like emoji:

> "".length
2
> "" == "\uD83D\uDCA9"
true

As an example of a grapheme cluster without a precomposed,
single-code-point form, look at the Regional Indicators, which were
the politics-free way to add flag symbols to the Emoji block. There
are 26 code points, "A" through "Z", and when put next to each other
in language codes, like "", it's expected that certain
combinations will show up as flags, without explicitly defining which
one. But a sequence of regional indicator code points is entirely one
grapheme cluster.

Go drops the term "character" or "code point" entirely and opts for
"rune" instead, which is just a 32-bit value.

Swift has an even crazier "Character" type [0], which can hold an
entire Grapheme Cluster, rather than just a single code-point. This
actually means that Swift's "Character" type is of potentially
infinite length, since Regional Indicators aren't capped at a maximum
of two code points.

Unicode is fun.

[0] 
https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285

On Thu, Mar 17, 2016 at 12:42 PM, Matthias Clasen
 wrote:
> On Thu, Mar 17, 2016 at 2:26 PM, Jasper St. Pierre
>  wrote:
>
>> I'll also ask what "character" means in this case, even though I know
>> glib also has the same confusion. Are you talking about the number of
>> Unicode code points in the string, or the number of grapheme clusters,
>> as defined by Unicode TR29 [0]? The number of code points isn't useful
>> for editing in all cases, even after NFC normalization. Some grapheme
>> clusters just don't have a single code-point representation.
>
> I don't think there is any confusion in glib about this, really.
> There is no mention of graphemes in GLib at all, its all just
> characters. If you want graphemes, you need pango.



-- 
  Jasper
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Matthias Clasen
On Thu, Mar 17, 2016 at 2:26 PM, Jasper St. Pierre
 wrote:

> I'll also ask what "character" means in this case, even though I know
> glib also has the same confusion. Are you talking about the number of
> Unicode code points in the string, or the number of grapheme clusters,
> as defined by Unicode TR29 [0]? The number of code points isn't useful
> for editing in all cases, even after NFC normalization. Some grapheme
> clusters just don't have a single code-point representation.

I don't think there is any confusion in glib about this, really.
There is no mention of graphemes in GLib at all, its all just
characters. If you want graphemes, you need pango.
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-19 Thread Jasper St. Pierre
I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.

[0] http://unicode.org/reports/tr29/

On Thu, Mar 17, 2016 at 11:18 AM, Randall Sawyer
 wrote:
> On 03/17/2016 10:39 AM, Randall Sawyer wrote:
>>
>>
>> On 03/17/2016 09:30 AM, Matthias Clasen wrote:

 I believe that you haven't found such a proposal because most people
 don't see much use in a separate boxed type for utf8 strings. Every
 string we pass around in GLib and GTK+, and every char * in their APIs
 is expected to be in utf8. The few exceptions to this rule are
 explicitly documented.
>>>
>>> GLib already provides a number of utilities for dealing with utf8
>>> strings in terms of characters, such as g_utf8_strlen,
>>> g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
>>> adding to that list, if there are glaring omissions.
>>
>> Here is the vision: Once raw string data - or gunichar value - has been
>> passed and validated into the construction of a "G_UTF8String" structure,
>> then contents of two-or-more of these can be easily combined without the
>> need for additional measuring or validating.
>
>
> Alright Matthias, after your thoughtful response, I have come to the
> following conclusion:  When considering management of dynamically allocated
> UTF-8 strings, there are actually two points to consider: 1) Whether the
> byte sequences are valid per IETF RFC 3629 Section 4 - and - 2) The number
> of distinct characters represented in the string vs. the total number of
> bytes used to represent these.
>
> If someone were to write a widget library or an application using libraries
> which ensure valid UTF-8 as input - Gdk key-press events and GtkIMContexts
> for example - then it wouldn't make sense to run those strings through yet
> another course of validation. That addresses the first issue.
>
> There is still the question of character length vs. byte length.
>
> Therefore - from what you have told me - I will be sure to present methods
> which feature validation as an option and not as the rule.
>
> Thank you.
>
>
>
> ___
> gtk-devel-list mailing list
> gtk-devel-list@gnome.org
> https://mail.gnome.org/mailman/listinfo/gtk-devel-list



-- 
  Jasper
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-18 Thread Randall Sawyer

On 03/18/2016 10:10 AM, Florian Müllner wrote:
On Fri, Mar 18, 2016 at 2:57 PM Randall Sawyer 
> wrote:


how about the following modifications?
Change "gstring.h":
...
struct _GString
{
   gchar  *str;
   gsize len;
   gsize allocated_len;
   gsize utf8_len;
};
...


 Changing the size of a public struct is an ABI break, so this is not 
an option for glib-2.x.


So, does that answer question 4?

Also - I just discovered that glibmm has a class Glib::ustring 
(https://developer.gnome.org/glibmm/stable/classGlib_1_1ustring.html). I 
am going to take a look through its source to see what they have there.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-18 Thread Randall Sawyer

On 03/17/2016 02:26 PM, Jasper St. Pierre wrote:

I'll also ask what "character" means in this case, even though I know
glib also has the same confusion. Are you talking about the number of
Unicode code points in the string, or the number of grapheme clusters,
as defined by Unicode TR29 [0]? The number of code points isn't useful
for editing in all cases, even after NFC normalization. Some grapheme
clusters just don't have a single code-point representation.

[0] http://unicode.org/reports/tr29/



Good question. Thank you, Jasper.

I just took a look at TR29. The  examples in the Table 1a. Sample 
Grapheme Clusters [1] are to me immediately illustrative of how multiple 
code points may be combined into a distinct grapheme ("character"?).


As I delve into Unicode, a hierarchy of order of eight-bit strings is 
emerging in my mind:


Bytes [Low level] : Strings of binary octets - typically terminated by 
the null byte 0x00. The number of bytes define the "length" of the 
string. This is the level currently served well by glib's GString structure.


Code Points [Middle level]: Sequences of 1 to 6 bytes - each either 
undefined or serving as a packet to deliver a unique code point. The 
number code points defines the "length" of the string. This is the level 
at which I am proposing that "G_UTF8String" - or something like it - 
will serve developers well.


Graphemes [High level]: Sequences of one or more code points - each 
serving as a packet to deliver a unique grapheme. In this case, the 
number of graphemes defines the "length" of of the string. This level 
can be best served with a strong middle level supporting it.


I am developing structures and methods to "Manage Strings of UTF-8 
Encoded Unicode Code Points". Middle level. Henceforth, I will refine my 
terminology - dropping entirely the term "character" as used in glib et 
al documentation - and adopting "utf8 code point" in its place.


[Geographically speaking as a north american, it is easy to slip into 
lazy provincial thought and to miss these distinctions. It might serve 
us all better if programming languages with a "char" type were to rename 
it "byte". Likewise, instead of "gchar" and "guchar", glib may adopt 
"gbyte" and "gubyte".]


[1] http://unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list





___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-18 Thread Randall Sawyer

On 03/17/2016 07:23 PM, Matthias Clasen wrote:

Sure, code point works too. Anyway, enough with the ontology, we're
not a standards body

I still don't think that we need a utf8-string datatype.


I have questions, then.

Here are excerpts from the current master files:
"gstring.h"
...
struct _GString
{
  gchar  *str;
  gsize len;
  gsize allocated_len;
};
...

"gstring.c"

...
/**
 * g_string_insert_len:
 * @string: a #GString
 * @pos: position in @string where insertion should
 *   happen, or -1 for at the end
 * @val: bytes to insert
 * @len: number of bytes of @val to insert
 *
 * Inserts @len bytes of @val into @string at @pos.
 * Because @len is provided, @val may contain embedded
 * nuls and need not be nul-terminated. If @pos is -1,
 * bytes are inserted at the end of the string.
 *
 * Since this function does not stop at nul bytes, it is
 * the caller's responsibility to ensure that @val has at
 * least @len addressable bytes.
 *
 * Returns: (transfer none): @string
 */
GString *
g_string_insert_len (GString *string,
 gssize   pos,
 const gchar *val,
 gssize   len)
...
/**
 * g_string_insert_unichar:
 * @string: a #GString
 * @pos: the position at which to insert character, or -1
 * to append at the end of the string
 * @wc: a Unicode character
 *
 * Converts a Unicode character into UTF-8, and insert it
 * into the string at the given position.
 *
 * Returns: (transfer none): @string
 */
GString *
g_string_insert_unichar (GString  *string,
 gssizepos,
 gunichar  wc)
...

1) Since GString handles insertion of both raw strings and gunichar 
values, then it is safe to assume that the raw strings are treated as UTF-8.
   In that case, does the value of the argument `pos' refer to C array 
index or to UTF-8 offset? [I had to read the source code to find out.]
2) If the former is true - which it is - then the developer will need to 
call g_utf8_strlen() to determine if there are multi-byte sequences to 
navigate - and if there are - g_utf8_offset_to_pointer() to locate the 
array index. Doesn't this increase processing demand?
3) Wouldn't it be helpful to keep track of how many code points 
("characters")are stored in the GString - a number which may be less 
than the value of GString.len - without needing to call g_utf8_strlen() 
each time to find out?
4) Would my efforts be better spent editing patches of "gstring.h" and 
"gstring.c" - or - to proceed as I am to introduce a parallel alternative?


If the answer to (4) is yes, then how about the following modifications?
Change "gstring.h":
...
struct _GString
{
  gchar  *str;
  gsize len;
  gsize allocated_len;
  gsize utf8_len;
};
...

Add to "gstring.h":
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_truncate_utf8   (GString  *string,
   gsize utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_len_utf8 (GString  *string,
   gssizeoffset,
   const gchar  *val,
   gssize utf8_len);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_utf8 (GString *string,
   gssize offset,
   const gchar *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_c_utf8   (GString *string,
gssize offset,
gchar c);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_insert_unichar_utf8 (GString *string,
gssize offset,
gchar wc);
...
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_utf8  (GString*string,
gssizeoffset,
   const gchar  *val);
GLIB_AVAILABLE_IN_2_XX
GString* g_string_overwrite_len_utf8  (GString  *string,
   gssizeoffset,
   const gchar  *val,
   gssizeutf8_len);

Add to "utf8.c":
...
GLIB_AVAILABLE_IN_2_XX
void   g_utf8_measure (const gchar  *utf8,
   glong max_len,
   gsize*utf8_len,
   gsize*byte_len,
   gboolean  validate);
GLIB_AVAILABLE_IN_2_XX
gchar* g_utf8_sized_offset_to_pointer (const gchar  *utf8,
   glong offset,
   gsize utf8_len,
   gsize byte_len);
...

Note 1: The GString functions ending in *_utf8 would check if values of 
GString.len and GString.utf8_len are equal - and directly access 
contained gchar array if they are, thus dispensing with looking up 
pointer from offset.
Note 2: The function g_utf8_measure() iterates the passed array once, 
simultaneously arriving at the 

Re: G_UTF8String: Boxed Type Proposal

2016-03-18 Thread Chris Vine
On Fri, 18 Mar 2016 10:19:08 -0400
Randall Sawyer  wrote:
> Also - I just discovered that glibmm has a class Glib::ustring 
> (https://developer.gnome.org/glibmm/stable/classGlib_1_1ustring.html).
> I am going to take a look through its source to see what they have
> there.

It does, although I have stopped using it except for cases where the
API demands it, in favour of std::string (the equivalent for C being
char*)

Sure, Glib::ustring tells you the number of unicode code points in the
string, but so what?  That knowledge is mostly useless.  When people
refer to a "character" they normally really mean a grapheme rather than
a code point.  However with combining characters, Hangul jamos, Indic
consonant clusters and other grapheme clusters, any given "character"
in this sense can require more than one code point and cannot be
displayed correctly without them.  Knowing the number of code points
doesn't help you.

Chris
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: G_UTF8String: Boxed Type Proposal

2016-03-18 Thread Randall Sawyer

On 03/17/2016 10:39 AM, Randall Sawyer wrote:


On 03/17/2016 09:30 AM, Matthias Clasen wrote:

I believe that you haven't found such a proposal because most people
don't see much use in a separate boxed type for utf8 strings. Every
string we pass around in GLib and GTK+, and every char * in their APIs
is expected to be in utf8. The few exceptions to this rule are
explicitly documented.

GLib already provides a number of utilities for dealing with utf8
strings in terms of characters, such as g_utf8_strlen,
g_utf8_substring, g_utf8_find_next/prev_char. We can certainly discuss
adding to that list, if there are glaring omissions. 
Here is the vision: Once raw string data - or gunichar value - has 
been passed and validated into the construction of a "G_UTF8String" 
structure, then contents of two-or-more of these can be easily 
combined without the need for additional measuring or validating.


Alright Matthias, after your thoughtful response, I have come to the 
following conclusion:  When considering management of dynamically 
allocated UTF-8 strings, there are actually two points to consider: 1) 
Whether the byte sequences are valid per IETF RFC 3629 Section 4 - and - 
2) The number of distinct characters represented in the string vs. the 
total number of bytes used to represent these.


If someone were to write a widget library or an application using 
libraries which ensure valid UTF-8 as input - Gdk key-press events and 
GtkIMContexts for example - then it wouldn't make sense to run those 
strings through yet another course of validation. That addresses the 
first issue.


There is still the question of character length vs. byte length.

Therefore - from what you have told me - I will be sure to present 
methods which feature validation as an option and not as the rule.


Thank you.


___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list