GitHub user juliuszsompolski opened a pull request:

    https://github.com/apache/spark/pull/20555

    [SPARK-23366] Improve hot reading path in ReadAheadInputStream

    ## What changes were proposed in this pull request?
    
    `ReadAheadInputStream` was introduced in 
https://github.com/apache/spark/pull/18317/ to optimize reading spill files 
from disk.
    However, from the profiles it seems that the hot path of reading small 
amounts of data (like readInt) is inefficient - it involves taking locks, and 
multiple checks.
    
    Optimize locking: Lock is not needed when simply accessing the active 
buffer. Only lock when needing to swap buffers or trigger async reading, or get 
information about the async state.
    
    Optimize short-path single byte reads, that are used e.g. by Java library 
DataInputStream.readInt.
    
    The asyncReader used to call "read" only once on the underlying stream, 
that never filled the underlying buffer when it was wrapping an 
LZ4BlockInputStream. If the buffer was returned unfilled, that would trigger 
the async reader to be triggered to fill the read ahead buffer on each call, 
because the reader would see that the active buffer is below the refill 
threshold all the time.
    
    However, filling the full buffer all the time could introduce increased 
latency, so also add an `AtomicBoolean` flag for the async reader to return 
earlier if there is a reader waiting for data.
    
    Remove `readAheadThresholdInBytes` and instead immediately trigger async 
read when switching the buffers. It allows to simplify code paths, especially 
the hot one that then only has to check if there is available data in the 
active buffer, without worrying if it needs to retrigger async read. It seems 
to have positive effect on perf.
    
    ## How was this patch tested?
    
    It was noticed as a regression in some workloads after upgrading to Spark 
2.3. 
    
    It was particularly visible on TPCDS Q95 running on instances with fast 
disk (i3 AWS instances).
    Running with profiling:
    * Spark 2.2 - 5.2-5.3 minutes 9.5% in LZ4BlockInputStream.read
    * Spark 2.3 - 6.4-6.6 minutes 31.1% in ReadAheadInputStream.read
    * Spark 2.3 + fix - 5.3-5.4 minutes 13.3% in ReadAheadInputStream.read - 
very slightly slower, practically within noise.
    
    We didn't see other regressions, and many workloads in general seem to be 
faster with Spark 2.3 (not investigated if thanks to async readed, or 
unrelated).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/juliuszsompolski/apache-spark SPARK-23366

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20555
    
----
commit 987f15ccb01b6c0351fbfdd49d6930b929c50a74
Author: Juliusz Sompolski <julek@...>
Date:   2018-01-30T20:54:47Z

    locking tweak

commit b26ffce6780078dbc38bff658e1ef7e9c56c3dd8
Author: Juliusz Sompolski <julek@...>
Date:   2018-02-01T14:27:09Z

    fill the read ahead buffer

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to