Am 27.04.2010 06:25, schrieb Xueming Shen:
Ulf Zibis wrote:
Am 24.04.2010 01:09, schrieb Xueming Shen:

I changed the data file "format" a bit, so now the overal uniName.dat is less than 88k (last version is 122+k), but the I can no long use cpLen as the capacity for the hashmap. I'm now using a hardcoded 20000 for 5.2.

Again, is 88k the compressed or the uncompressed size ?

Yes, it's the size of compressed data.

I'm wondering, as script.txt only has ~120k.

Your smart "save one more byte" suggestion will save
400+byte, a tiny 0.5%, unfortunately:-)

I didn't mean the save by total file footprint, I meant it by byte-wise read() count. My code only needs to read 1 int per character block against 1 byte + 1 int, which looks kinda ugly too.
Anyway, the theoretical max win would be < 20 %.



-- Is it faster, first copying the whole date in a byte[], and then using ByteBuffer.getInt etc. against directly using DataInputStream methods?
The current impl use neither ByteBuffer nor DataInputStream now, so no compare here.
If JIT-compiled, bb.get() should be as fast as ba[cpOff++] & 0xff.
My compare is about the manually byte2int assembling + triple buffering the data (getResourceAsStream() is a buffered stream, and I believe InflaterInputStream too)

Yes, to use DataInputStream will definitely makes code look better (no more those "ugly" shifts), but it also will slow down thing a little since it adds one more layer. But speed
may not really a concern here.

On the other hand:
- layer shouldn't matter if DIS is yet JIT-compiled.
- readInt() might be faster than 4 times read() + manually assembling the int value. (if not, DataInputStream needs reengineering) - readFully() might be better optimized than your hand-coded read loop (if not, let's do it ;-) ) -- hand-coded loop might only make sense, if using thread.sleep() after each chunk, so concurrent threads could continue their work, while waiting for the harddisk to read. - your code will surely run in interpreter mode, as GIT wouldn't have time to compile it fast enough. - there is some chance, that DIS will be yet JIT-compiled from usage of other program parts before. - and last but not least, use the given API's for byte code footprint reduction as most as you can. Give good programming example as newbies tend to use API sources as first template for their own code. Seeing API use cases helps to become familiar with the complexity of the Java-API. (Same for Arrays.binarySearch())



-- You could create a very long String with the whole data and then use subString for the individual strings which could share the same backing char[].

The disadvantage of using a big buffer String to hold everything then have the individual names to substring from it is that it might simply break the softreference logic here. The big char[] will never been gc-ed as long as there is still one single name object (substring-ed from it) is still walking around in system somewhere.
I don't think the vm/gc is that smart, isn't it?

Good point, I missed that.
But I'm still no friend of the SR usage here. It doesn't solve my main complain: - In-economically initializing the whole amount of data for likely 1 or few invocations of getName(int cp), and repetitively, if SR was cleaned. - Don't pollute the GC more than necessary (it would have to handle each of the strings + char[]s separate), especially if memory comes towards it's limit. Additionally, if not interned, equal character name strings would be hold in memory for as many copies, as SR fails, if interned, they would never be GC'd. You may argue, that code is rarely used, but if all corners of the Java API would be coded such memory/performance-wasting, we ... I don't think about it better. We could add (Attention: CCC change) a cacheCharacterNames(boolean yesNo) method to serve users, which excessively need this functionality.



But this will definitely be faster, given the burden of creating a String from bytes (we put in the optimization
earlier, so this operation should be faster now compared to 6u).

+ saving the memory overhead + GC work for the cpNum char[]s.


Additionally:
- No need to compare iis != null in finally block, possible NPE would be thrown earlier. - Move SR logic to get() method to omit the possible remaining SR->NPE problem:
    public static String get(int cp) {
        HashMap<Integer, String> names;
        if (refNames == null || (names = refNames.get()) == null)
            refNames = new SoftReference<>(names = getNames());
        return names.get(cp);
    }
- then synchronize entire getNames() method.
- save 2nd null-check after sync, as fail would still be much more unlikely as getName(int cp) usage at all, and only risks 2nd superfluous init. - Is it good idea to return null in case of io fail to calling code, instead propagating the given exception or better throwing an error?
- use Integer.toHexString(cp) instead Integer.toString(cp, 16);
- IMPORTANT (check if CCC is affected):
  Do I understand right, that j.l.Ch.getName('5') would return:
      "Basic Latin 35"
  ... but j.l.Ch.getName('0') would return:
      "DIGIT ZERO..DIGIT NINE"
  I think both should return:
"DIGIT ZERO..DIGIT NINE" (otherwise we don't have to cache that value ;-) )
  or at least:
      "Basic Latin U+0035"

See new version in attachment.


-Ulf


Reply via email to