On 8/27/2012 1:59 PM, Thilo Goetz wrote: > Once you're done, I'd be interested to know if this has any measurable effect > beyond the noise level.
Here's my back of the envelope calculations on this: Each string object has an ref to char Array (4-8 bytes) + an offset (4) and length (4). Each char array object has an length (4) + the space for the chars. In addition, there's "object" overhead of 8 - 16 bytes per object (depends on 32 or 64 bit Java, etc.). In the case of a string using an individual char array, the space is approx (using 8 byte (64 bit) address refs): 2 obj overheads + string obj + char array obj + char length * 2 = 32 + 16 + 4 + char length * 2 = 52 + char length * 2. With this patch, it becomes 16 + 16 + char length * 2 = 32 + char length * 2. So, the savings depends on the "average" size of the character strings, but might amount to 20 bytes / string. It's somewhat hard to say what typical CASes have as strings versus other space, and what the average string length might be. For one set I looked at, strings made up about 1/2 the space, but the average string length was about 50 chars. With this, the total string space might have started out as n * (52 + 50*2) = n * 152. After this patch, the space would be n * (32 + 50 *2) = n * 132. If the string space accounted for 50% of the CAS, the savings in CAS space would be 20/152 divided by 2 or about 6.5 %. If the strings accounted for more than 50 % of the CAS space, or the average string length was less than 100, the % of savings (in CAS size) would be larger. Of course, there are other things that take up space in a UIMA application, besides the CAS; counting all of that will reduce the overall % effect when measured as a percent of the total space used by the entire application. So, I would tend to agree that the savings is probably not all that large, in most "typical" cases. -Marshall > > On 27.08.2012 16:29, Marshall Schor (JIRA) wrote: >> Marshall Schor created UIMA-2460: >> ------------------------------------ >> >> Summary: Binary deserialization inefficient >> Key: UIMA-2460 >> URL: https://issues.apache.org/jira/browse/UIMA-2460 >> Project: UIMA >> Issue Type: Improvement >> Components: Core Java Framework >> Reporter: Marshall Schor >> Assignee: Marshall Schor >> Priority: Minor >> Fix For: 2.4.1SDK >> >> >> The CAS binary deserialization code can be made (much) more space efficient. >> Currently, the char data that is used in the strings is read into a char >> array; each string is represented as an offset into this char array + a >> length; and new Java strings are created using new String(chararray, offset, >> length). This works, but it allocates a new char array for each string being >> created, and copies from the original char array. This results in new char >> array objects for each string object. >> >> The alternative is to reuse the original char array object, and not allocate >> any other char array objects. This can be done by: >> * making a temporary string from the entire char array object, and then >> * making the new strings using tempString.substring(offset, offset + length) >> >> For 1000 strings, this will save 999 char array object overheads (probably >> about 16 bytes per). >> >> An additional space savings is possible by reusing the same string object for >> equal strings. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA administrators >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> > >
