Hi Ogata,
On 09/20/2017 12:12 PM, Kazunori Ogata wrote:
Hi Peter,
The benchmark is GradientBoostingTree of Intel HiBench [1]. HiBench is a
suite of programs using Hadoop or Spark, and GradientBoostingTree is a
Spark program. The source code (in Scala) is [2]. To build the code, you
need Apache Spark.
The command line is equivalent to java -Xmx10g -D spark.master="local[4]"
GradientBoostingTree <inputDir> 100, but what I actually use is a Java
program that calls the main method and measures its execution time using
currentTimeMills().
By the way, I'm running the benchmark on POWER8 machine. Removing
volatile won't change the performance on x86.
[1] https://github.com/intel-hadoop/HiBench
[2]
https://github.com/intel-hadoop/HiBench/blob/master/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/GradientBoostingTree.scala
Regards,
Ogata
Huh, I thought it would be something easier to run. Am I right that the
improvement we are expecting comes from execution of Java serialization
and deserialization of some data structure? If you could extract from
the benchmark just the approximate shape of the data structure and
typical values it contains, I could create a JMH benchmark that tests
just that part. Which would be appropriate to tune serialization code.
After some best variant is chosen, you could verify it by running your
test in your Spark setup. I think there is still room for improvement. I
have a few ideas I would like to test.
Regards, Peter