Reynold Xin created SPARK-18853:
-----------------------------------
Summary: Project is way too aggressive in estimating statistics
Key: SPARK-18853
URL: https://issues.apache.org/jira/browse/SPARK-18853
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Reynold Xin
We currently define statistics in UnaryNode:
{code}
override def statistics: Statistics = {
// There should be some overhead in Row object, the size should not be zero
when there is
// no columns, this help to prevent divide-by-zero error.
val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
// Assume there will be the same number of rows as child has.
var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) /
childRowSize
if (sizeInBytes == 0) {
// sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be
zero
// (product of children).
sizeInBytes = 1
}
child.statistics.copy(sizeInBytes = sizeInBytes)
}
{code}
This has a few issues:
1. This can aggressively underestimate the size for Project. We assume each
array/map has 100 elements, which is an overestimate. If the user projects a
single field out of a deeply nested field, this would lead to huge
underestimation. A safer sane default is probably 2.
2. It is not a property of UnaryNode to propagate statistics this way. It
should be a property of Project.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]