UnsafeByteArrayOutputStream.java

Uwe Schindler Thu, 30 Jun 2011 23:51:18 -0700

Hi,

I don't understand the whole discussion here, so please compare these two 
implementations and tell me which one is faster. Please don't hurt me, if you 
don't want to see src.jar code from OpenJDK Java6 - just delete this mail if 
you don’t want to (the code here is licensed under GPL):

Arrays.java:
    /**
     * Copies the specified array, truncating or padding with zeros (if 
necessary)
     * so the copy has the specified length.  For all indices that are
     * valid in both the original array and the copy, the two arrays will
     * contain identical values.  For any indices that are valid in the
     * copy but not the original, the copy will contain <tt>(byte)0</tt>.
     * Such indices will exist if and only if the specified length
     * is greater than that of the original array.
     *
     * @param original the array to be copied
     * @param newLength the length of the copy to be returned
     * @return a copy of the original array, truncated or padded with zeros
     *     to obtain the specified length
     * @throws NegativeArraySizeException if <tt>newLength</tt> is negative
     * @throws NullPointerException if <tt>original</tt> is null
     * @since 1.6
     */
    public static byte[] copyOf(byte[] original, int newLength) {
        byte[] copy = new byte[newLength];
        System.arraycopy(original, 0, copy, 0,
                         Math.min(original.length, newLength));
        return copy;
    }

This is our implementation, simon replaced and Robert reverted 
(UnsafeByteArrayOutputStream):

  private void grow(int newLength) {
    // It actually should be: (Java 1.7, when its intrinsic on all machines)
    // buffer = Arrays.copyOf(buffer, newLength);
    byte[] newBuffer = new byte[newLength];
    System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
    buffer = newBuffer;
  }

So please look at the code, where is a difference that could slow down, except 
the Math.min() call which is an intrinsic in almost every JDK on earth?

The problem we are talking about here is only about the generic Object[] copyOf 
method and also affects e.g. *all* Collection.toArray() methods - they all use 
this code, so whenever you use ArrayList.toArray() or similar, the slow code is 
executed. This is why we replaced Collections.sort() by CollectionUtil.sort, 
that does no array copy. Simon & me were not willing to replace the 
reallocations in FST code (Mike you remember, we reverted that on your GIT repo 
when we did perf tests) and other parts in Lucene (there are only few of them). 
The idea was only to replace primitive type code to make it easier readable. 
And with later JDK code it could even get faster (not slower), if Oracle starts 
to add intrinsics for those new methods (and that’s Dawid and mine reason to 
change to copyTo for primitive types). In general, if you use Java SDK methods, 
that are as fast as ours, they always have a chance to get faster in later 
JDKs. So we should always prefer Java SDK methods, unless they are slower 
because their default impl is too generic or has too much safety checks or uses 
reflection.

To come back to UnsafeByteArrayOutputStream:

I would change the whole code, as I don’t like the allocation strategy in it 
(it's exponential, on every grow it doubles its size). We should change that to 
use ArrayUtils.grow() and ArrayUtils.oversize(), to have a similar allocation 
strategy like in trunk. Then we can discuss about this problem again when Simon 
& me wants to change ArrayUtils.grow methods to use Arrays.copyOf... *g* [just 
joking, I will never ask again, because this discussion here is endless and 
does not bring us forward].

The other thing I don’t like in the new faceting module is duplication of vint 
code. Why don’t we change it to use DataInput/DataOutput and use Dawid's new 
In/OutStream wrapper for DataOutput everywhere. This would be much more 
streamlined with all the code we currently have. Then we can encode the 
payloads (or later docvalues) using the new UnsafeByteArrayOutputStream, 
wrapped with a OutputStreamDataOutput wrapper? Or maybe add a 
ByteArrayDataOutput class.

Uwe
(getting crazy)

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Simon Willnauer [mailto:[email protected]]
> Sent: Friday, July 01, 2011 7:47 AM
> To: Michael McCandless
> Cc: [email protected]; Dawid Weiss
> Subject: Re: svn commit: r1141510 -
> /lucene/dev/trunk/modules/facet/src/java/org/apache/lucene/util/Unsafe
> ByteArrayOutputStream.java
> 
> On Fri, Jul 1, 2011 at 12:19 AM, Michael McCandless
> <[email protected]> wrote:
> > On Thu, Jun 30, 2011 at 4:45 PM, Simon Willnauer
> > <[email protected]> wrote:
> >> On Thu, Jun 30, 2011 at 8:50 PM, Dawid Weiss
> >> <[email protected]> wrote:
> >>>> I don't seen any evidence that this is any slower though.
> >>>
> >>> You need to run with -client (if the machine is a beast this is
> >>> tricky because x64 will pick -server regardless of the command-line
> >>> setting) and you need to be copying generic arrays. I think this can
> >>> be shown
> >>> -- a caliper benchmark would be perfect to demonstrate this in
> >>> isolation; I may write one if I find a spare moment.
> >>
> >> this is what I want to see. I don't want to discuss based on some bug
> >> reported for a non-primitive version of copyOf thats all.
> >> its pointless to discuss if there is no evidence which I don't see. I
> >> am happy with arraycopy I would just have appreciated a discussion
> >> before backing the change out.
> >
> > I think the burden of proof here is on Arrays.copyOf.
> >
> > Ie, until we can prove (through benchmarking in different envs) that
> > it can be trusted, we should just stick with System.arraycopy to
> > reduce the risk.
> 
> Mike I think we should do that but the real issue here is what if somebody
> comes up with any arbitrary method in the future claiming its slow we back
> out and use the "we think safe way" what if it is actually the other way
> around and copyOf is optimized by new VMs and the copyarray is slightly
> slower. If robert would not have reverted this commit we would have not
> discussed that her at all. I just want to prevent the "we should not do this" 
> it
> might be a problem in the big picture while the microbenchmark doesn't
> show a difference. At some point we have to rely on the JVM.
> Even if we benchmark on all OS with various JVMs we can't prevent jvm bug
> based perf hits. While there was no bug reported for primitives here we
> don't have to be afraid of it either. I don't think its saver to use 
> arraycopy at
> all its just a level of indirection safer but makes development more of a pain
> IMO.
> 
> Just my $0.05
> >
> > Mike
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For additional
> commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: svn commit: r1141510 - /lucene/dev/trunk/modules/facet/src/java/org/apache/lucene/util/UnsafeByteArrayOutputStream.java

Reply via email to