Nice, could be very useful, slow UTF8 encoding/decoding is a potential 
bottleneck.

On 27 Jun 2014, at 19:23, [email protected] wrote:

> Impressive numbers and a nice examplz to grasp lots of NB/Asm details.
> Le 27 juin 2014 18:55, "Henrik Johansen" <[email protected]> a 
> écrit :
> >
> > Soo, I started dabbling with the thing I talked about before last summer, 
> > letting String parameters in NB calls have an encoding: option.
> > (There’s already a slice in inbox to allow optional values other than 
> > true/false)
> >
> > Thought I’d start with decoding; here’s a small preview of the part which 
> > does the actual decoding, after needed string class has has been determined 
> > and instantiated.
> > While it’s a fallback path for when the platform doesn’t support SSE or 
> > other batch operations, it’s still using some neat tricks (imho) I thought 
> > others might enjoy on a Friday afternoon :)
> >
> > emitStandardDecodeUTF8CharactersFrom: aSource to: aDestination 
> > withCharSize: aCharSize scratchReg: scratchReg using: aGenerator
> >         "Emit decoding using only standard x86 ops"
> >         "We have already found what String class is needed for decoding 
> > aSource, and created an instance of the proper size"
> >         "This implementation focuses on minimizing jumps and register 
> > usage, at the cost of loading from source one byte at a time.
> >         "Input:
> >                 aSource - memory pointer to C-string with UTF8 bytes
> >                 aDestination - memoryPointer to first var field of String 
> > instance
> >                 scratchReg - a register which will be modified while 
> > decoding
> >
> >                 aCharSize - The size in bytes of each character in our 
> > destination string, known at emission time
> >
> >         Clobbers: scratchReg
> >         aSource and aDestination will end up pointing to end of strings"
> >
> >         | asm scratch32 sLowByte sHighByte loop done oneByte twoBytes 
> > threeBytes  |
> >
> >         asm := aGenerator asm.
> >         loop := asm uniqueLabelName: 'utf8DecodeLoop'.
> >         done := asm uniqueLabelName: 'utf8DecodingDone'.
> >         scratch32 := scratchReg as32.
> >         sLowByte  := scratch32 as8.
> >         sHighByte := sLowByte asHighByte.
> >
> >         asm label: loop.
> >         "Unroll the inner loop as many times as we want, or, well, at least 
> > as many times as the backwards jump will allow us to"
> >         8 timesRepeat:[
> >         oneByte := asm uniqueLabelName: 'utf8OneByteDecode'.
> >         twoBytes := asm uniqueLabelName: 'utf8TwoByteDecode'.
> >         threeBytes := asm uniqueLabelName: 'utf8ThreeByteDecode'.
> >         asm xor: scratch32 with:scratch32.
> >         asm or: sLowByte with: aSource ptr8.
> >         asm cmp: sLowByte with: 0.
> >         asm je: done.
> >         asm add: aSource with: 1.
> >         asm test: sLowByte with: 2r10000000 asUImm8.
> >         asm jz: oneByte.
> >         "We have a header, place its data bits as initial high byte value"
> >         asm shl: scratch32 with: 8.
> >         asm xor: sHighByte with: 2r11000000 asUImm8. "Strip 2 byte header"
> >         asm test: sHighByte with: 2r00100000.
> >         asm jz: twoBytes.
> >         aCharSize > 1 ifTrue: [
> >         asm xor: sHighByte with: 2r00100000. "Strip 3 byte header"
> >         asm test: sHighByte with:       2r000100000.
> >         asm jz: threeBytes.
> >         "This is a 4-byte character"
> >         asm xor: sHighByte with:2r00010000."Strip 4 byte header"
> >         "Read one trailing byte, remove the header, and shift the data out 
> > of low byte"
> >         asm or: sLowByte with: aSource ptr8.
> >         asm shl: sLowByte with:2.
> >         asm shl: scratch32 with: 6.
> >         asm add: aSource with: 1.
> > asm label: threeBytes.
> >         "Read one trailing byte, remove the header, and shift the data out 
> > of low byte"
> >         asm or: sLowByte with: aSource ptr8.
> >         asm shl: sLowByte with:2.
> >         asm shl: scratch32 with: 6.
> >         asm add: aSource with: 1.
> >         ].
> > asm label: twoBytes.
> >         "Read last trailing byte, remove header, and shift the data bits 
> > into proper place"
> >         asm or: sLowByte with: aSource ptr8.
> >         asm shl: sLowByte with:2.
> >         asm shr: scratch32 with: 2.
> >         asm add: aSource with: 1.
> > asm label: oneByte.
> >         asm mov: (aDestination ptr size: aCharSize) with: (scratch32 as: 
> > aCharSize).
> >         asm add: aDestination with: aCharSize.].
> >         asm jmp: loop.
> >         asm label: done.
> >
> > And the relevant test code for that:
> >
> > testStandardDecodeWide
> >         | bytes string |
> >         "bytes := (ZnUTF8Encoder new encodeString: 'Cash, like €, is 
> > king'), #[0]."
> >         bytes := #[67 97 115 104 44 32 108 105 107 101 32 226 130 172 44 32 
> > 105 115 32 107 105 110 103 0].
> >         string := WideString new: bytes size - 1.
> >         self testStandardDecode: bytes toWideString: string.
> >         ^ string
> >
> > testStandardDecode: utf8Bytes toWideString:aString
> >         <primitive: #primitiveNativeCall module: #NativeBoostPlugin>
> >         ^ self nbCallout
> >                 function: #(void #(char* utf8Bytes,  char* aString ))
> >                 emit: [ :gen :proxy :asm |
> >                         asm pop: asm EBX;
> >                                 pop: asm ECX.
> >                         self emitStandardDecodeUTF8CharactersFrom: asm EBX 
> > to: asm ECX withCharSize: 4 scratchReg: asm EAX using: gen.
> >                         asm mov: EAX with: gen proxy nilObject  ]
> >
> > Which, though it’s currently cheating by pre-knowledn string class/size, 
> > isn’t alot of overhead:
> > ext := NBExternalString new.
> > [ext testStandardDecodeWide]  bench  '5,030,000 per second.' '5,080,000 per 
> > second.' '5,190,000 per second.’
> > Compared to an equivalent to testStandardDecodeWide, with emitStandard… 
> > removed from the primitive:
> > [ext testEmptyDecode]  bench '5,850,000 per second.' '5,800,000 per 
> > second.' '5,640,000 per second.'
> >
> > … or compared to doing the decoding in image after the call:
> > int := ZnUTF8Encoder new.
> > [int decodeBytes:#[67 97 115 104 44 32 108 105 107 101 32 226 130 172 44 32 
> > 105 115 32 107 105 110 103 0]] bench  '130,000 per second.' '131,000 per 
> > second.' '132,000 per second.’
> >
> > Cheers,
> > Henry
> >


Reply via email to