I use the gmail web interface, which is not great. I'll just comment
without quoting.

The thing I'm trying to address is the fact that all CF objects must start
with:
struct {
        void *isa;
        uint32_t info;
};
That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit
for general/restricted use.

If that 32-bit (or it could be 64-bit) field could be the same for constant
strings, it would allow CFString functions to work directly with ObjC
constant strings, instead of having to call the toll-free bridging
mechanism. That would be much more efficient for container objects in
corebase.

Just to be clear, the CFString structure is currently:
struct {
        void *isa;
        uint32_t info;
        char *data;
        long count;
        long hash;
        void *allocator;
};

If the ObjC constant string structure and the CFString structure were
similar, they could be used interchangeably in corebase and base.

So my proposal was to arrange the first top-most portion of the new
constant string structure as:
sturct {
        void *isa;
        uint64_t info; /* includes both info and hash */
        char *data;
        long count;
};

If I modified the corebase version to match, these structure, with a little
help from libobjc, could be exactly the same.

On Thu, Apr 5, 2018 at 3:33 PM, David Chisnall <gnus...@theravensnest.org>
wrote:

> This might be slightly confusing, because your mail client doesn’t seem to
> do anything sane for quoting:
>
> On 5 Apr 2018, at 20:09, Stefan Bidigaray <stefanb...@gmail.com> wrote:
> >
> > On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <
> gnus...@theravensnest.org> wrote:
> > On 5 Apr 2018, at 17:27, Stefan Bidigaray <stefanb...@gmail.com> wrote:
> > >
> > > Hi David,
> > > I forgot to make a comment when you originally posted the idea, and I
> think this would be a great time to add my 2 cents.
> > >
> > > Regarding the structure:
> > > * Would it not be better to add the flags bit field immediately after
> the isa pointer? My thought here is that it can be checked for if different
> versions of the structure exist. This is important for CoreBase since it
> does not have the luxury of real classes.
> >
> > I’m concerned with structure padding here.  Even on a 64-bit platform,
> we either need an 8-byte flags field (which is wasteful) or end up with 4
> bytes of padding.  With 128-bit pointers (which are probably coming sooner
> than you expect) we will end up with 12 bytes of padding if we have a
> 32-bit flags field followed by a pointer.
> >
> > Well, I was hoping there is a way we can define this structure so that
> it can be used directly in CoreBase, without having to call the toll-free
> bridging mechanism. If a 32-bit hash is used, could it be combined with the
> "flags" variable (see the structure I included at the end of this email)?
> I'm hoping to be able to have use the same constant strings without having
> to call the bridging mechanism. It's pretty slow and cumbersome.
>
> Can you explain why CoreBase needs to store the hash as anything other
> than a 32-bit value that it can zero extend when returning a 64-bit value?
> It the CoreFoundation and Foundation implementations of hash are
> compatible, then it will currently be returning a 28-bit value in a 64-bit
> register, so I don’t understand the issue here.
>
> >
> > By the way, I noticed there was not uint32_t flags in your original
> structure, making it 24 bytes in 32-bit CPUs.
> >
> > > * Would it be possible to make the hash variable a NSUInterger? The
> output of -hash is an NSUInterger, and that would allow the value to be
> expanded in the future.
> >
> > We can, though that would again increase the size quite noticeably.  I
> think I’m happy with a 32-bit hash, because as rfm points out with a decent
> hash algorithm that basically gives us unique hashes.
> >
> > Sounds reasonable.
> >
> > > * Why have both count and length? Would it not make more sense to keep
> a single variable here called count and define it as, "The count/number of
> code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16
> it would be the # of 16-bit codes. The Apple documentation states "The
> number of UTF-16 code units in the receiver", making at least the ASCII and
> UTF-16 numbers correct. The way I understand the current implementation,
> the value for length would return the UTF-32 # of characters, which is
> inconsistent with the docs.
> >
> > If a UTF-8 string contains multi-byte sequences, then the length of the
> buffer and the number if UTF-16 code units will be different.  If we know
> the number of bytes, then we can use more efficient C standard library
> functions for things like comparisons, though that may not be important.
> >
> > I guess I'm still a bit confused about the meaning and/or different of
> the variables count and length.
>
> One tells you the logical number of characters, the other the length of
> the buffer in bytes.  A lot of bytes-scanning functions are far more
> efficient if they know the length up front, because they can then process
> one word at a time until the last word.
>
> > I know this is probably going to be rejected, but how about making
> constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I
> know this would increase the byte count for most European languages using
> Latin characters, but I don't see the point of maintaining both UTF-8 and
> UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in
> UTF-8 (and vise-versa), so how would the compiler pick between the two?
> Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the
> code significantly?
>
> There’s also the issue that -UTF8String is one of the most commonly used
> methods on NSString, so if we represent something as UTF-16 internally then
> it needs converting and returning in an autoreleased buffer, whereas with a
> UTF-8 string it can just return the pointer.  On non-Windows platforms,
> -UTF8String is the way of getting a string that you pass to pretty much any
> OS function.
>
> >
> > > * I would also think that it makes more sense to have the length/count
> variable before the data pointer. I don't have a strong opinion about this
> one, but it just makes more sense in my head.
> >
> > Again, this gives us more padding in the structure.
> >
> > Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit
> architectures (except WIN64)? I thought a long would not be padded any more
> than a pointer for most applications.
>
> Not Win64, not on anything with larger than 64-bit pointers.
>
> > >
> > > Regarding the hash function:
> > > Why are we using Murmur3 hash? I know it is significantly more
> efficient than our current one-at-a-time approach, but how much better is
> it to competing hash functions? Is there a bench mark out there comparing
> some of the major ones? For example, how does it compare with lookup3 or
> SpookyHash. If we are storing the hash in the string structure, the speed
> of calculating the hash is not as important as the spread. Additionally,
> Murmur3 seems ill suited if NSUInteger is used to store the hash value
> since, as far as I could tell, it only outputs 32-bit and 128-bit hashes.
> Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words
> in the case of lookup3), as well.
> >
> > The size of the type doesn’t necessarily give us the range.  We are
> completely free to give only a 32-bit or even 28-bit range within an
> NSUInteger (which is what we do now) and if we have good coverage.  A good
> hash function has even distribution of entropy across all bits, so taking a
> 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to
> make the hash value 8 bytes on 64-bit platforms if this seems like a good
> use of bits.
> >
> > I’m not wedded to the idea of Murmur3.  We do need to use the same hash
> for constant and non-constant strings, so execution speed is important.
> I’m somewhat tempted to suggest SHA256, because it’s fairly easy to
> accelerate with SSE and newer CPUs have full hardware offload for it.  That
> said, the goal is not to mandate the use of the compiler-generated hash for
> constant strings, it’s to provide a space to store one that the compiler
> initialises to something sensible.
> >
> > Given the analysis I’ve done in the reply to Ivan, I think it’s worth
> consuming space to improve performance.
> >
> > I agree.
> >
> > So how about a structure like:
> >
> > struct {
> >         id isa; /* Class pointer */
> >         uint64_t flags;
> >         /* Flags bitfield:
> >            Low 2 bits, enum with values:
> >            0: ASCII string
> >            1: UTF-16 string
> >            2 and 3: Reserved for future encodings
> >            (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2
> and 3 reserved for future hashes
> >            (1<<4) to (1<<15): Reserved for future compiler-defined flags
> >            (1<<16) to (1<<31): Reserved for use by the constant string
> class (I'm hoping this could hold the CFTypeID of a constant string so it
> can be identified by corebase)
> >            (1<<32) to (1<<63): hash
> >         */
> >         const char *data; /* Pointer to the buffer.  ro_data section, so
> immutable.  NULL-terminated */
> >         long count;  /* Number of UTF-16 code units, not including the
> null terminator */
> > }
>
> I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts
> and I don’t like the ordering (it will be annoying to have to order the
> fields differently on 128-bit pointer platforms).  I’m not convinced that
> it’s worth omitting the length to save 8 bytes per string.  It’s probably
> also not actually worth using longs for the length on 64-bit platforms, so
> both of these should probably be 32 bits.  4GB of string literal seems a
> bit excessive (for one thing, I doubt the compiler will be entirely happy
> with it, and I don’t know happy linkers are with 4GB symbols…).
>
> David
>
>
_______________________________________________
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev

Reply via email to