> On 29 Jan 2018, at 09:59, Sven Van Caekenberghe <[email protected]> wrote:
>
> Great results, Marcus.
>
>> On 29 Jan 2018, at 09:18, Marcus Denker <[email protected]> wrote:
>>
>> Right now #embeddSourceInTrailer encoded and decodes every method to utf8.
>> This is fairly slow.
>>
>> We do not need to actually use utf8, the only thing important is that we
>> interpret the bits correctly when we decode (wide string or not?).
>> As a first step we then can even just utf8 encode the widestrings, there are
>> not many in the image.
>
> As a speedup it is certainly a good strategy to encode ByteStrings into
> Latin1 ByteStrings, since this is a no-op. But I would always encode
> WideStrings as UTF-8 since that is a much more efficient, variable length
> encoding. Storing a WideStrings as 32-bit characters would be quite wasteful.
>
> Intuitively it feels like a simple compression scheme with a shared
> dictionary of a couple of thousand of the most common substrings in method
> source code would be able to compress sources quite a bit. Such compression
> would not break literal searching.
Yes, and for real search speed we could look again into indexing… It should be
possible to build a search index on demand before the first search and
cache it (so it would never be saved in the image and never waste memory in
deployment).
With the we could be even get real time full text search. That is, it would be
faster then “senders of” is now.
Marcus