Re: Unicode script support in Regex and Character class

Ulf Zibis Tue, 27 Apr 2010 07:37:53 -0700

Am 27.04.2010 06:25, schrieb Xueming Shen:

Ulf Zibis wrote:
Am 24.04.2010 01:09, schrieb Xueming Shen:
I changed the data file "format" a bit, so now the overaluniName.dat is less than 88k (last version is 122+k), butthe I can no long use cpLen as the capacity for the hashmap. I'm nowusing a hardcoded 20000 for 5.2.
Again, is 88k the compressed or the uncompressed size ?
Yes, it's the size of compressed data.


I'm wondering, as script.txt only has ~120k.

Your smart "save one more byte" suggestion will save
400+byte, a tiny 0.5%, unfortunately:-)

I didn't mean the save by total file footprint, I meant it by byte-wiseread() count.My code only needs to read 1 int per character block against 1 byte + 1int, which looks kinda ugly too.

Anyway, the theoretical max win would be < 20 %.

-- Is it faster, first copying the whole date in a byte[], and thenusing ByteBuffer.getInt etc. against directly using DataInputStreammethods?
The current impl use neither ByteBuffer nor DataInputStream now, so nocompare here.

If JIT-compiled, bb.get() should be as fast as ba[cpOff++] & 0xff.

My compare is about the manually byte2int assembling + triple bufferingthe data (getResourceAsStream() is a buffered stream, and I believeInflaterInputStream too)

Yes, to use DataInputStream will definitely makes code look better (nomore those "ugly"shifts), but it also will slow down thing a little since it adds onemore layer. But speed
may not really a concern here.


On the other hand:
- layer shouldn't matter if DIS is yet JIT-compiled.

- readInt() might be faster than 4 times read() + manually assemblingthe int value. (if not, DataInputStream needs reengineering)- readFully() might be better optimized than your hand-coded read loop(if not, let's do it ;-) )-- hand-coded loop might only make sense, if using thread.sleep() aftereach chunk,so concurrent threads could continue their work, while waiting forthe harddisk to read.- your code will surely run in interpreter mode, as GIT wouldn't havetime to compile it fast enough.- there is some chance, that DIS will be yet JIT-compiled from usage ofother program parts before.- and last but not least, use the given API's for byte code footprintreduction as most as you can. Give good programming example as newbiestend to use API sources as first template for their own code. Seeing APIuse cases helps to become familiar with the complexity of the Java-API.(Same for Arrays.binarySearch())

-- You could create a very long String with the whole data and thenuse subString for the individual strings which could share the samebacking char[].
The disadvantage of using a big buffer String to hold everything thenhave the individual names to substringfrom it is that it might simply break the softreference logic here.The big char[] will never been gc-ed aslong as there is still one single name object (substring-ed from it)is still walking around in system somewhere.
I don't think the vm/gc is that smart, isn't it?


Good point, I missed that.

But I'm still no friend of the SR usage here. It doesn't solve my maincomplain:- In-economically initializing the whole amount of data for likely 1 orfew invocations of getName(int cp), and repetitively, if SR was cleaned.- Don't pollute the GC more than necessary (it would have to handle eachof the strings + char[]s separate), especially if memory comes towardsit's limit.Additionally, if not interned, equal character name strings would behold in memory for as many copies, as SR fails, if interned, they wouldnever be GC'd.You may argue, that code is rarely used, but if all corners of the JavaAPI would be coded such memory/performance-wasting, we ... I don't thinkabout it better.We could add (Attention: CCC change) a cacheCharacterNames(booleanyesNo) method to serve users, which excessively need this functionality.

But this will definitely be faster, given the burden of creating aString from bytes (we put in the optimization
earlier, so this operation should be faster now compared to 6u).


+ saving the memory overhead + GC work for the cpNum char[]s.


Additionally:

- No need to compare iis != null in finally block, possible NPE would bethrown earlier.- Move SR logic to get() method to omit the possible remaining SR->NPEproblem:

    public static String get(int cp) {
        HashMap<Integer, String> names;
        if (refNames == null || (names = refNames.get()) == null)
            refNames = new SoftReference<>(names = getNames());
        return names.get(cp);
    }
- then synchronize entire getNames() method.

- save 2nd null-check after sync, as fail would still be much moreunlikely as getName(int cp) usage at all, and only risks 2nd superfluousinit.- Is it good idea to return null in case of io fail to calling code,instead propagating the given exception or better throwing an error?

- use Integer.toHexString(cp) instead Integer.toString(cp, 16);
- IMPORTANT (check if CCC is affected):
  Do I understand right, that j.l.Ch.getName('5') would return:
      "Basic Latin 35"
  ... but j.l.Ch.getName('0') would return:
      "DIGIT ZERO..DIGIT NINE"
  I think both should return:

"DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache thatvalue ;-) )

  or at least:
      "Basic Latin U+0035"

See new version in attachment.


-Ulf

Re: Unicode script support in Regex and Character class

Reply via email to