Vincent,
I put this in, but was getting exceptions because the array overflowed in the
Lucene.Net.Misc.Util.Fst.TestFSTsMisc.TestRandomWords() test.
I ended up changing it since it was highly optimized for non-surrogate pairs,
but if using surrogate pairs it was copying arrays over and over to resize the
array on every iteration in a tight loop. So, to compromise, we just double the
count in cases where it is 1024 chars or less and do a loop to count the actual
chars to set the array size exactly if it is over that. Running all of the
tests in Analysis.Common used to take 14-16 minutes, now it takes 10-11
minutes. Now it looks like this:
public static char[] ToCharArray(int[] codePoints, int offset, int
count)
{
if (count < 0)
{
throw new System.ArgumentException();
}
int countThreashhold = 1024; // If the number of chars exceeds
this, we count them instead of allocating count * 2
// LUCENENET: as a first approximation, assume each codepoint
// is 2 characters (since it cannot be longer than this)
int arrayLength = count * 2;
// LUCENENET: if we go over the threashhold, count the number of
// chars we will need so we can allocate the precise amount of
memory
if (count > countThreashhold)
{
arrayLength = 0;
for (int r = offset, e = offset + count; r < e; ++r)
{
arrayLength += codePoints[r] < 0x010000 ? 1 : 2;
}
if (arrayLength < 1)
{
arrayLength = count * 2;
}
}
// Initialize our array to our exact or oversized length.
// It is now safe to assume we have enough space for all of the
characters.
char[] chars = new char[arrayLength];
int w = 0;
for (int r = offset, e = offset + count; r < e; ++r)
{
int cp = codePoints[r];
if (cp < 0 || cp > 0x10ffff)
{
throw new System.ArgumentException();
}
if (cp < 0x010000)
{
chars[w++] = (char)cp;
}
else
{
chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >>
LEAD_SURROGATE_SHIFT_));
chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp &
TRAIL_SURROGATE_MASK_));
}
}
var result = new char[w];
Array.Copy(chars, result, w);
return result;
}
Anyway, if you think we need to adjust the threshold or if you have any other
suggestions, let me know.
Thanks,
Shad Storhaug (NightOwl888)
-----Original Message-----
From: Van Den Berghe, Vincent [mailto:[email protected]]
Sent: Wednesday, March 22, 2017 9:59 PM
To: [email protected]
Subject: Proposal: UnicodeUtils.ToCharArray implementation
When the dust settles, I propose to rewrite UnicodeUtils.ToCharArray as follows:
public static char[] ToCharArray(int[] codePoints, int offset, int
count)
{
if (count < 0)
{
throw new System.ArgumentException();
}
// as a first approximation, assume each codepoint
is 1 character
char[] chars = new char[count];
int w = 0;
for (int r = offset, e = offset + count; r < e; ++r)
{
int cp = codePoints[r];
if (cp < 0 || cp > 0x10ffff)
{
throw new System.ArgumentException();
}
if (cp < 0x010000)
{
chars[w++] = (char)cp;
}
else
{
chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >>
LEAD_SURROGATE_SHIFT_));
// if we need more room,
add enough to hold 2-character codepoints for the rest of
// the string. Normally,
this resize operation will happen at most once.
if (w >= chars.Length)
Array.Resize(ref
chars, chars.Length + (e - r) * 2 - 1);
chars[w++] =
(char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_));
}
}
// resize to the exact length: it's slightly faster
to check if the resize is needed
if (w != chars.Length)
Array.Resize(ref chars, w);
return chars;
}
This avoids exception overhead and at least one heap allocation. If no code
points generate 2 characters, the method doesn't allocate anything extra than
the result array.
Vincent