[
https://issues.apache.org/jira/browse/HIVE-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310895#comment-16310895
]
Gopal V commented on HIVE-17573:
--------------------------------
[~kellyzly]: I'm using LLAP with ORC, loaded using the bin_flat tpc-h script in
hive-testbench.
https://github.com/hortonworks/hive-testbench/tree/hdp26/ddl-tpch/bin_flat
The hardware is {{Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz}}, with 256Gb RAM
and with the following NUMA organization
The memory is split as 128Gb Xmx + 32Gb cache for 24 executors, with a 180Gb
container, which pretty much can fit the entire Q6 data in cache at the 1Tb
scale.
If you have the text-cache enabled (this takes multiple flags), you might be
able to get similar performance from the text data as well, but the significant
ORC speedup comes from loading data into lineitem in a natural order (the
production-like ingest results in one file per day).
{code}
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 131037 MB
node 0 free: 127359 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 131072 MB
node 1 free: 127987 MB
node distances:
node 0 1
0: 10 21
1: 21 10
{code}
The setup uses a giant TLAB maximum so that the in-thread allocations go to the
same NUMA zone.
{{-XX:TLABSize=128m -XX:+ResizeTLAB -XX:+UseNUMA -XX:+AggressiveOpts
-XX:MetaspaceSize=1024m}}
JDK9 seems to wake up the producer-consumer pair on the same NUMA zone (the IO
elevator allocates, passes the array to the executor thread and executor passes
it back instead of throwing it to GC deref).
I'm not sure there's any actual movement on JEP-157 which would probably help
this thread-to-thread object passing much more.
bq. From which tool, you can get above conclusion?
https://github.com/t3rmin4t0r/perf-map-agent/blob/jitdump/jit-objdump.sh
That's the script which I use to attach GDB to a running JIT process and
extract a JIT sample, with the additional CPU perf events.
Here's an example of the final report I gather from the JIT (this was sent to
Intel JDK team as a perf report, to see if they could fix {{public String(byte
ascii[], int hibyte, int offset, int count)}} to be faster for very small
strings).
http://people.apache.org/~gopalv/perf-29529.tbz2
This is a perf event capture which contains for Q6 on text data (instead of ORC)
{code}
perf record -ag -e
cycles,instructions,branch-misses,LLC-prefetch-misses,cache-misses,LLC-store-misses,LLC-load-misses
{code}
along with the JIT generated assembly.
If you're on a x86_64 machine, then I guess run-report.sh should work.
> LLAP: JDK9 support fixes
> ------------------------
>
> Key: HIVE-17573
> URL: https://issues.apache.org/jira/browse/HIVE-17573
> Project: Hive
> Issue Type: Bug
> Components: llap
> Affects Versions: 3.0.0
> Reporter: Gopal V
> Assignee: Gopal V
>
> The perf diff between JDK8 -> JDK9 seems to be significant.
> TPC-H Q6 on JDK8 takes 32s on a single node + 1 Tb scale warehouse.
> TPC-H Q6 on JDK9 takes 19s on the same host + same data.
> The performance difference seems to come from better JIT and better NUMA
> handling.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)