Re: More memory-efficient internal representation for Strings: call for more data

2014-12-03 Thread charlie hunt
Potentially in the future. It has been on a list of candidate enhancements for 
quite some time.

As Aleksey just mentioned in his response, (he beat me to the punch), that work 
is not in scope as part of this project.

Should also mention that the work from this project would not prohibit such an 
enhancement.

hths,

charlie

 On Dec 2, 2014, at 4:13 PM, Vitaly Davidovich vita...@gmail.com wrote:
 
 Any consideration towards removing the char[] (or byte[]) indirection 
 altogether? .NET for example stores the bytes inline with the instance.
 
 Sent from my phone
 
 On Dec 2, 2014 4:59 PM, Aleksey Shipilev aleksey.shipi...@oracle.com 
 mailto:aleksey.shipi...@oracle.com wrote:
 Hi,
 
 As you may already know, we are looking into more memory efficient
 representation for Strings:
  https://bugs.openjdk.java.net/browse/JDK-8054307 
 https://bugs.openjdk.java.net/browse/JDK-8054307
 
 As part of preliminary performance work for this JEP, we have to collect
 the empirical data on usual characteristics of Strings and char[]-s
 normal applications have, as well as figure out the early estimates for
 the improvements based on that data. What we have so far is written up here:
  http://cr.openjdk.java.net/~shade/density/string-density-report.pdf 
 http://cr.openjdk.java.net/~shade/density/string-density-report.pdf
 
 We would appreciate if people who are interested in this JEP can provide
 the additional data on their applications. It is double-interesting to
 have the data for the applications that process String data outside
 Latin1 plane. Our current data says these cases are rather rare. Please
 read the current report draft, and try to process your own heap dumps
 using the instructions in the Appendix.
 
 Thanks,
 -Aleksey.
 



More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Aleksey Shipilev
Hi,

As you may already know, we are looking into more memory efficient
representation for Strings:
 https://bugs.openjdk.java.net/browse/JDK-8054307

As part of preliminary performance work for this JEP, we have to collect
the empirical data on usual characteristics of Strings and char[]-s
normal applications have, as well as figure out the early estimates for
the improvements based on that data. What we have so far is written up here:
 http://cr.openjdk.java.net/~shade/density/string-density-report.pdf

We would appreciate if people who are interested in this JEP can provide
the additional data on their applications. It is double-interesting to
have the data for the applications that process String data outside
Latin1 plane. Our current data says these cases are rather rare. Please
read the current report draft, and try to process your own heap dumps
using the instructions in the Appendix.

Thanks,
-Aleksey.



Re: More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Vitaly Davidovich
Any consideration towards removing the char[] (or byte[]) indirection
altogether? .NET for example stores the bytes inline with the instance.

Sent from my phone
On Dec 2, 2014 4:59 PM, Aleksey Shipilev aleksey.shipi...@oracle.com
wrote:

 Hi,

 As you may already know, we are looking into more memory efficient
 representation for Strings:
  https://bugs.openjdk.java.net/browse/JDK-8054307

 As part of preliminary performance work for this JEP, we have to collect
 the empirical data on usual characteristics of Strings and char[]-s
 normal applications have, as well as figure out the early estimates for
 the improvements based on that data. What we have so far is written up
 here:
  http://cr.openjdk.java.net/~shade/density/string-density-report.pdf

 We would appreciate if people who are interested in this JEP can provide
 the additional data on their applications. It is double-interesting to
 have the data for the applications that process String data outside
 Latin1 plane. Our current data says these cases are rather rare. Please
 read the current report draft, and try to process your own heap dumps
 using the instructions in the Appendix.

 Thanks,
 -Aleksey.




Re: More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Aleksey Shipilev
Hi Vitaly,

Please read the JEP proposal. String/char[] fusion (that's what you are
describing) is out of scope for this work. Baby steps. Careful baby steps.

-Aleksey.

On 03.12.2014 01:13, Vitaly Davidovich wrote:
 Any consideration towards removing the char[] (or byte[]) indirection
 altogether? .NET for example stores the bytes inline with the instance.
 
 Sent from my phone
 
 On Dec 2, 2014 4:59 PM, Aleksey Shipilev aleksey.shipi...@oracle.com
 mailto:aleksey.shipi...@oracle.com wrote:
 
 Hi,
 
 As you may already know, we are looking into more memory efficient
 representation for Strings:
  https://bugs.openjdk.java.net/browse/JDK-8054307
 
 As part of preliminary performance work for this JEP, we have to collect
 the empirical data on usual characteristics of Strings and char[]-s
 normal applications have, as well as figure out the early estimates for
 the improvements based on that data. What we have so far is written
 up here:
  http://cr.openjdk.java.net/~shade/density/string-density-report.pdf
 http://cr.openjdk.java.net/%7Eshade/density/string-density-report.pdf
 
 We would appreciate if people who are interested in this JEP can provide
 the additional data on their applications. It is double-interesting to
 have the data for the applications that process String data outside
 Latin1 plane. Our current data says these cases are rather rare. Please
 read the current report draft, and try to process your own heap dumps
 using the instructions in the Appendix.
 
 Thanks,
 -Aleksey.
 




Re: More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Douglas Surber
String construction is a big performance issue for JDBC drivers. Most 
queries return some number of Strings. The overwhelming majority of 
those Strings will be short lived. The cost of constructing these 
Strings from network bytes is a large fraction of total execution 
time. Any increase in the cost of constructing a String will far out 
weigh any reduction in memory use, at least for query results.


All of the proposed compression methods require an additional scan of 
the entire string. That's exactly the wrong direction. Something like 
the following pseudo-code is common inside a driver.


  {
char[] c = new char[n];
for (i = 0; i  n; i++) c[i] = charSource.next();
return new String(c);
  }

The array copy inside the String constructor is a significant 
fraction of JDBC driver execution time. Adding an additional scan on 
top of it is making things worse regardless of the transient benefit 
of more compact storage. In the case of a query result the String 
will be likely never be promoted out of new space; the benefit of 
compression would be minimal.


I don't dispute that Strings occupy a significant fraction of the 
heap or that a lot of those bytes are zero. And I certainly agree 
that reducing memory footprint is valuable, but any worsening of 
String construction time will likely be a problem.


Douglas

At 02:13 PM 12/2/2014, core-libs-dev-requ...@openjdk.java.net wrote:

Date: Wed, 03 Dec 2014 00:59:10 +0300
From: Aleksey Shipilev aleksey.shipi...@oracle.com
To: Java Core Libs core-libs-dev@openjdk.java.net
Cc: charlie hunt charlie.h...@oracle.com
Subject: More memory-efficient internal representation for Strings:
call formore data
Message-ID: 547e362e.5010...@oracle.com
Content-Type: text/plain; charset=utf-8

Hi,

As you may already know, we are looking into more memory efficient
representation for Strings:
 https://bugs.openjdk.java.net/browse/JDK-8054307

As part of preliminary performance work for this JEP, we have to 
collect

the empirical data on usual characteristics of Strings and char[]-s
normal applications have, as well as figure out the early estimates 
for
the improvements based on that data. What we have so far is written 
up here:


http://cr.openjdk.java.net/~shade/density/string-density-report.pdf

We would appreciate if people who are interested in this JEP can 
provide
the additional data on their applications. It is double-interesting 
to

have the data for the applications that process String data outside
Latin1 plane. Our current data says these cases are rather rare. 
Please
read the current report draft, and try to process your own heap 
dumps

using the instructions in the Appendix.

Thanks,
-Aleksey.




Re: More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Aleksey Shipilev
Hi Douglas,

On 12/03/2014 02:24 AM, Douglas Surber wrote:
 String construction is a big performance issue for JDBC drivers. Most
 queries return some number of Strings. The overwhelming majority of
 those Strings will be short lived. The cost of constructing these
 Strings from network bytes is a large fraction of total execution time.
 Any increase in the cost of constructing a String will far out weigh any
 reduction in memory use, at least for query results.

You will also have to take into the account that shorter (compressed)
Strings allow for more efficient operations on them. This is not to
mention the GC costs are also usually hidden from the naive
performance estimations: even though you can perceive the mutator is
spending more time doing work, that might be offset by easier job for GC.

 All of the proposed compression methods require an additional scan of
 the entire string. That's exactly the wrong direction. Something like
 the following pseudo-code is common inside a driver.
 
   {
 char[] c = new char[n];
 for (i = 0; i  n; i++) c[i] = charSource.next();
 return new String(c);
   }

Good to know. We will be assessing the String(char[]) construction
performance in the course of this performance work. What would you say
is a characteristic high-level benchmark for the scenario you are
describing?

 The array copy inside the String constructor is a significant fraction
 of JDBC driver execution time. Adding an additional scan on top of it is
 making things worse regardless of the transient benefit of more compact
 storage. In the case of a query result the String will be likely never
 be promoted out of new space; the benefit of compression would be minimal.

It's hard to say at this point. We want to understand what footprint
improvements we are talking about. I agree that if cost-benefit analysis
will say the performance is degrading beyond the sane limits even if we
are happy with memory savings, there is little reason to push this into
the general JDK.

Thanks,
-Aleksey



Re: More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Douglas Surber
The most common operation on most Strings in query results is to do 
nothing. Just construct the String, hold onto it while the rest of 
the transaction completes, then drop it on the floor. Probably the 
next most common is to encode the chars to write them to an 
OutputStream or send them back to the database. I'd be curious how a 
compact representation would help those operations.


SPECjEnterprise is a widely used standard benchmark. It probably uses 
mostly (or even entirely) ASCII characters so it's not representative 
of many customers.


My definition of sane limits might be different than yours. As far 
as I'm concerned String construction is already too slow and should 
be made faster by eliminating the char[] copy when possible.


Douglas

At 03:47 PM 12/2/2014, Aleksey Shipilev wrote:

Hi Douglas,

On 12/03/2014 02:24 AM, Douglas Surber wrote:
 String construction is a big performance issue for JDBC drivers. 
Most
 queries return some number of Strings. The overwhelming majority 
of

 those Strings will be short lived. The cost of constructing these
 Strings from network bytes is a large fraction of total execution 
time.
 Any increase in the cost of constructing a String will far out 
weigh any

 reduction in memory use, at least for query results.

You will also have to take into the account that shorter 
(compressed)

Strings allow for more efficient operations on them. This is not to
mention the GC costs are also usually hidden from the naive
performance estimations: even though you can perceive the mutator is
spending more time doing work, that might be offset by easier job 
for GC.


 All of the proposed compression methods require an additional 
scan of
 the entire string. That's exactly the wrong direction. Something 
like

 the following pseudo-code is common inside a driver.

   {
 char[] c = new char[n];
 for (i = 0; i  n; i++) c[i] = charSource.next();
 return new String(c);
   }

Good to know. We will be assessing the String(char[]) construction
performance in the course of this performance work. What would you 
say

is a characteristic high-level benchmark for the scenario you are
describing?

 The array copy inside the String constructor is a significant 
fraction
 of JDBC driver execution time. Adding an additional scan on top 
of it is
 making things worse regardless of the transient benefit of more 
compact
 storage. In the case of a query result the String will be likely 
never
 be promoted out of new space; the benefit of compression would be 
minimal.


It's hard to say at this point. We want to understand what footprint
improvements we are talking about. I agree that if cost-benefit 
analysis
will say the performance is degrading beyond the sane limits even if 
we
are happy with memory savings, there is little reason to push this 
into

the general JDK.

Thanks,
-Aleksey





Re: More memory-efficient internal representation for Strings: call for more data

2014-12-02 Thread Xueming Shen

On 12/02/2014 04:42 PM, Douglas Surber wrote:

The most common operation on most Strings in query results is to do nothing. 
Just construct the String, hold onto it while the rest of the transaction 
completes, then drop it on the floor. Probably the next most common is to 
encode the chars to write them to an OutputStream or send them back to the 
database. I'd be curious how a compact representation would help those 
operations.



It depends on what inside those query results. If most of them are ascii, 
only a small portion
are double byted user data (for example, it is true for most of the utf8 xml 
files), you might
be able to save the cpu time/throughput by only copying half length of the 
bytes around their
life circle, especially copy around is the only operation they are carrying 
on.

-Sherman



SPECjEnterprise is a widely used standard benchmark. It probably uses mostly 
(or even entirely) ASCII characters so it's not representative of many 
customers.

My definition of sane limits might be different than yours. As far as I'm 
concerned String construction is already too slow and should be made faster by 
eliminating the char[] copy when possible.

Douglas

At 03:47 PM 12/2/2014, Aleksey Shipilev wrote:

Hi Douglas,

On 12/03/2014 02:24 AM, Douglas Surber wrote:
 String construction is a big performance issue for JDBC drivers. Most
 queries return some number of Strings. The overwhelming majority of
 those Strings will be short lived. The cost of constructing these
 Strings from network bytes is a large fraction of total execution time.
 Any increase in the cost of constructing a String will far out weigh any
 reduction in memory use, at least for query results.

You will also have to take into the account that shorter (compressed)
Strings allow for more efficient operations on them. This is not to
mention the GC costs are also usually hidden from the naive
performance estimations: even though you can perceive the mutator is
spending more time doing work, that might be offset by easier job for GC.

 All of the proposed compression methods require an additional scan of
 the entire string. That's exactly the wrong direction. Something like
 the following pseudo-code is common inside a driver.

   {
 char[] c = new char[n];
 for (i = 0; i  n; i++) c[i] = charSource.next();
 return new String(c);
   }

Good to know. We will be assessing the String(char[]) construction
performance in the course of this performance work. What would you say
is a characteristic high-level benchmark for the scenario you are
describing?

 The array copy inside the String constructor is a significant fraction
 of JDBC driver execution time. Adding an additional scan on top of it is
 making things worse regardless of the transient benefit of more compact
 storage. In the case of a query result the String will be likely never
 be promoted out of new space; the benefit of compression would be minimal.

It's hard to say at this point. We want to understand what footprint
improvements we are talking about. I agree that if cost-benefit analysis
will say the performance is degrading beyond the sane limits even if we
are happy with memory savings, there is little reason to push this into
the general JDK.

Thanks,
-Aleksey