zhengruifeng commented on issue #27461: [SPARK-30736][ML] One-Pass ChiSquareTest
URL: https://github.com/apache/spark/pull/27461#issuecomment-582752561
 
 
   testCode:
   ```scala
   import org.apache.spark.ml.clustering._
   import org.apache.spark.storage.StorageLevel
   import org.apache.spark.ml.stat.ChiSquareTest
   
   
   val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df.persist(StorageLevel.MEMORY_AND_DISK)
   df.count
   
   
   val start = System.currentTimeMillis; Seq.range(0, 100).foreach{i => 
ChiSquareTest.test(df, "features", "label").head}; val end = 
System.currentTimeMillis; val dur = end - start;
   ```
   
   a9a: numFeatures=123, numInstances=32,561
   
   result:
   this PR: 71520
   master: 87407
   
   even numFeatures<1000, this PR is still faster than existing impl, maybe 
because this PR parallelize the computation of `ChiSqTestResult`, existing impl 
need to compute them one by one on the driver;
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to