On Thu, 24 Mar 2022 17:43:31 GMT, XenoAmess <d...@openjdk.java.net> wrote:

>> 8186958: Need method to create pre-sized HashMap
>
> XenoAmess has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - update jmh
>  - refine javadoc; refine implement when expectedSize < 0

OK, finally got some time to look at this. Here's a rewrite of the spec words, 
at least for HashMap::newHashMap. If this settles down, I'll write the CSR for 
this and LHM and WHM.

    /**
     * Creates a new, empty HashMap suitable for the expected number of 
mappings.
     * The returned map uses the default load factor of 0.75, and its initial 
capacity is
     * generally large enough so that the expected number of mappings can be 
added
     * without resizing the map.
     *
     * @param numMappings the expected number of mappings
     * @param <K>         the type of keys maintained by this map
     * @param <V>         the type of mapped values
     * @return the newly created map
     * @throws IllegalArgumentException if numMappings is negative
     * @since 19
     */

The original wording was taken from CHM, which generally is a reasonable thing 
to do. Unfortunately, it's wrong. :-) "Table size" isn't accurate; HashMap uses 
"capacity" as the term for the number of buckets (== length of the internal 
table array). "Size" means "number of mappings" so its use of "table size" 
confuses these two concepts. The CHM wording also uses "elements" which applies 
to linear collections; the things inside a map are usually referred to as 
"mappings" or "entries". (I guess we should fix up CHM at some point too.)

While "expectedSize" isn't inaccurate, it's not tied to the main text, so I've 
renamed it to numMappings.

There are a couple other javadoc style rules operating here. The first sentence 
is generally a sentence fragment that is short and descriptive, as it will be 
pulled into a summary table. (It's often written as if it were a sentence that 
begins "This method..." but those words are elided.) Details are in the rest of 
the first paragraph. The text for `@param`, `@return`, and `@throws` are 
generally also sentence fragments, and they have no initial capital letter nor 
trailing period.

--

On performance and benchmarking: this is a distraction from the main point of 
this effort, which is to add an API that allows callers a correct and 
convenient way to create a properly-sized HashMap. Any work spent on optimizing 
performance is almost certainly wasted.

First, the benchmark: it's entirely unclear what this is measuring. It's 
performing the operation 2^31 times, but it sends the result to a black hole so 
that the JIT doesn't eliminate the computation. One of the actual results is 
0.170 ops/sec. This includes both the operation and the black hole, so we don't 
actually have any idea what that result represents. _Maybe_ it's possible to 
infer some idea of relative performance of the different operations by 
comparing the results. It's certainly plausible that the integer code is faster 
than the float or double code. But the benchmark doesn't tell us how long the 
actual computation takes.

Second, how significant is the cost of the computation? I'll assert that it's 
insignificant. The table length is computed once at HashMap creation time, and 
it's amortized over the addition of all the initial entries and subsequent 
queries and updates to the HashMap. Any of the computations (whether integer or 
floating point) are a handful of nanoseconds. This will be swamped by the first 
hashCode computation that causes a cache miss.

Third: I'll stipulate that the integer computation is probably a few ns faster 
than the floating-point computation. But the computation is entirely 
non-obvious. To make up for it being non-obvious, there's a big comment that 
explains it all. It's not worth doing something that increases performance by 
an insignificant amount that also requires a big explanation.

Finally, note that most of the callers are already doing a floating-point 
computation to compute the desired capacity, and it doesn't seem to be a 
problem.

Sorry, you probably spent a bunch of time on this already, but trying to 
optimize the performance here just isn't worthwhile. Let's please just stick 
with our good old `(int) Math.ceil(numMappings / 0.75)`.

--

There should be regression tests added for the three new methods. I haven't 
thought much about this. It might be possible to reuse some of the 
infrastructure in the WhiteBoxResizeTest we worked on previously.

--

I think it would be good to include updates to some of the use sites in this 
PR. It's often useful to put new APIs into practice. One usually learns 
something from the effort. Even though this is a really simple API, looking at 
use sites can illuminating, to see how the code reads. This might affect the 
method name, for example.

You don't need to update all of the use sites in the JDK, but it would be good 
to choose a reasonable sample. Maybe the ones from a single package, or a 
handful (like java.lang or java.util). Maybe include 
Class::enumConstantDirectory. If this use site is updated, then maybe it will 
allow us to get rid of that ConstantDirectoryOptimalCapacity test that we 
problem-listed in the previous PR.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7928

Reply via email to