zhengruifeng commented on issue #27461: [SPARK-30736][ML] One-Pass ChiSquareTest URL: https://github.com/apache/spark/pull/27461#issuecomment-582752561 testCode: ```scala import org.apache.spark.ml.clustering._ import org.apache.spark.storage.StorageLevel import org.apache.spark.ml.stat.ChiSquareTest val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a") df.persist(StorageLevel.MEMORY_AND_DISK) df.count val start = System.currentTimeMillis; Seq.range(0, 100).foreach{i => ChiSquareTest.test(df, "features", "label").head}; val end = System.currentTimeMillis; val dur = end - start; ``` a9a: numFeatures=123, numInstances=32,561 result: this PR: 71520 master: 87407 even numFeatures<1000, this PR is still faster than existing impl, maybe because this PR parallelize the computation of `ChiSqTestResult`, existing impl need to compute them one by one on the driver;
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
