srowen commented on a change in pull request #23983: [SPARK-26881][mllib]
Heuristic for tree aggregate depth
URL: https://github.com/apache/spark/pull/23983#discussion_r267007403
##########
File path:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
##########
@@ -775,6 +778,27 @@ class RowMatrix @Since("1.0.0") (
s"The number of rows $m is different from what specified or previously
computed: ${nRows}.")
}
}
+
+ /**
+ * Computing desired tree aggregate depth necessary to avoid exceeding
+ * driver.MaxResultSize during aggregation.
+ * Based on the formulae: (numPartitions)^(1/depth) * objectSize <=
DriverMaxResultSize
+ * @param aggregatedObjectSizeInMb the size, in megabytes, of the object
being tree aggregated
+ */
+ private[spark] def getTreeAggregateIdealDepth(aggregatedObjectSizeInMb: Int)
= {
+ val maxDriverResultSizeInMb = rows.conf.get[Long](MAX_RESULT_SIZE) / (1024
* 1024)
Review comment:
Sorry to pick on this, but what about dealing in bytes here, not MB? I think
we might have a problem if the aggregatedObjectSize is so small that it rounds
down to 0 MB and then below you take the log of 0.
I apologize for only thinking about this now, but I think we have a problem
when the object size is nearly equal to the max. The desired depth could be
really big, like 1000 or more.
Indeed, the denominator can be 0 or negative. I suspect we want to not fail
in this case but just use a max depth in those cases too.
How about capping the depth between 1 and, say, 10 to be safe? as a
heuristic I don't think depths larger than that are reasonable anyway. Use 10
if denominator is <= 0.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]