[
https://issues.apache.org/jira/browse/KUDU-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Wong updated KUDU-3153:
------------------------------
Description:
We sometimes encounter interesting behavior when viewing the rowset layout
diagram, like the quartiles indicating well-compacted (32MB-sized) rowsets,
while the compaction policy dump shows all rowsets very much undersized (around
10MB).
Looking through what's used where, a snippet from the patch for KUDU-2701
indicates, the policy considers only base data and redo files sizes, excluding
the PK index and bloom filters:
{quote}It's not totally clear to me why just base data and REDOs are used, ...
{quote}
After some spelunking, it seems like the usage of base data + redo file size
stems from a time when DiskRowSet didn't have an interface to get the full size
of the DRS, as seen in an [older version of
RowSetInfo|https://github.com/apache/kudu/blame/6a12ba3f7d66dcf748e8864aae8139813c1c4746/src/kudu/tablet/rowset_info.cc#L256]
and the [corresponding version of
diskrowset.h|https://github.com/apache/kudu/blob/6a12ba3f7d66dcf748e8864aae8139813c1c4746/src/kudu/tablet/diskrowset.h#L335].
We should probably consider using the full size of the DRSs – I suspect that
would give us more fruitful estimates to the efficacy of a compaction,
especially in the context of a "small rowset" compaction (see KUDU-1400).
was:
We sometimes encounter interesting behavior when viewing the rowset layout
diagram, like the quartiles indicating well-compacted (32MB-sized) rowsets,
while the compaction policy dump shows all rowsets very much undersized (around
10MB).
Looking through what's used where, a snippet from the patch for KUDU-2701
indicates, the policy considers only base data and redo files sizes, excluding
the PK index and bloom filters:
{quote}
It's not totally clear to me why just base data and REDOs are used, ...
{quote}
After some spelunking, it seems like the usage of base data + redo file size
stems from a time when DiskRowSet didn't have an interface to get the full size
of the DRS, as seen in an [older version of
RowSetInfo|https://github.com/apache/kudu/blame/6a12ba3f7d66dcf748e8864aae8139813c1c4746/src/kudu/tablet/rowset_info.cc#L256]
and the [corresponding version of
diskrowset.cc|https://github.com/apache/kudu/blob/6a12ba3f7d66dcf748e8864aae8139813c1c4746/src/kudu/tablet/diskrowset.h#L335].
We should probably consider using the full size of the DRSs -- I suspect that
would give us more fruitful estimates to the efficacy of a compaction,
especially in the context of a "small rowset" compaction (see KUDU-1400).
> Use full DRS size when considering rowsets to compact
> -----------------------------------------------------
>
> Key: KUDU-3153
> URL: https://issues.apache.org/jira/browse/KUDU-3153
> Project: Kudu
> Issue Type: Bug
> Components: compaction, tserver
> Reporter: Andrew Wong
> Priority: Major
> Attachments: Screen Shot 2020-06-19 at 5.06.19 PM.png
>
>
> We sometimes encounter interesting behavior when viewing the rowset layout
> diagram, like the quartiles indicating well-compacted (32MB-sized) rowsets,
> while the compaction policy dump shows all rowsets very much undersized
> (around 10MB).
> Looking through what's used where, a snippet from the patch for KUDU-2701
> indicates, the policy considers only base data and redo files sizes,
> excluding the PK index and bloom filters:
> {quote}It's not totally clear to me why just base data and REDOs are used, ...
> {quote}
> After some spelunking, it seems like the usage of base data + redo file size
> stems from a time when DiskRowSet didn't have an interface to get the full
> size of the DRS, as seen in an [older version of
> RowSetInfo|https://github.com/apache/kudu/blame/6a12ba3f7d66dcf748e8864aae8139813c1c4746/src/kudu/tablet/rowset_info.cc#L256]
> and the [corresponding version of
> diskrowset.h|https://github.com/apache/kudu/blob/6a12ba3f7d66dcf748e8864aae8139813c1c4746/src/kudu/tablet/diskrowset.h#L335].
> We should probably consider using the full size of the DRSs – I suspect that
> would give us more fruitful estimates to the efficacy of a compaction,
> especially in the context of a "small rowset" compaction (see KUDU-1400).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)