[
https://issues.apache.org/jira/browse/SOLR-13512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854890#comment-16854890
]
Andrzej Bialecki commented on SOLR-13512:
------------------------------------------
This patch contains an {{IndexSizeEstimator}} tool (which is both a
command-line utility and a Solr component). It provides the functionality
described above. The patch contains also extensions to
{{/admin/<collection>/segments}} and {{/admin/collections?action=COLSTATUS}} to
efficiently report this data from live {{IndexReader}}-s of a collection.
Here's an example output of COLSTATUS that contains just the overview section
(the collection contains a partial Wikipedia dump):
{code}
curl
'http://localhost:8983/solr/admin/collections?action=COLSTATUS&rawSizeInfo=true'
{
"responseHeader": {
"status": 0,
"QTime": 49406
},
"gettingstarted": {
...
"shards": {
"shard1": {
...
"leader": {
...
"segInfos": {
...
"rawSize": {
"fieldsBySize": {
"761 MB": "revision.text",
"88.7 MB": "revision.text_str",
"29.4 MB": "revision",
"26.4 MB": "revision.sha1",
"24.8 MB": "revision.comment",
"18.9 MB": "revision.comment_str",
"13.5 MB": "title",
"12.5 MB": "revision.contributor",
"11.9 MB": "revision.sha1_str",
"9.2 MB": "revision.timestamp",
"8.8 MB": "revision.contributor.id",
"7.3 MB": "revision.format",
"7.1 MB": "id",
"6.8 MB": "revision.parentid",
"6.3 MB": "revision.contributor.username",
"6.1 MB": "revision.model",
"4.6 MB": "title_str",
"4.3 MB": "revision.format_str",
"3.8 MB": "revision.contributor.username_str",
"3.1 MB": "_version_",
"2.8 MB": "revision.model_str",
"2.7 MB": "revision.contributor_str",
...
}
}
}
}
},
"shard2": {
...
"leader": {
...
"segInfos": {
...
"rawSize": {
"fieldsBySize": {
"769.4 MB": "revision.text",
"89.2 MB": "revision.text_str",
"31.2 MB": "revision",
"28 MB": "revision.sha1",
"26.4 MB": "revision.comment",
"20.7 MB": "revision.comment_str",
"14.2 MB": "title",
"13.3 MB": "revision.contributor",
"12.6 MB": "revision.sha1_str",
"9.8 MB": "revision.timestamp",
"9.4 MB": "revision.contributor.id",
"7.7 MB": "revision.format",
"7.6 MB": "id",
"6.9 MB": "revision.parentid",
"6.7 MB": "revision.contributor.username",
"6.5 MB": "revision.model",
"4.7 MB": "title_str",
"4.5 MB": "revision.format_str",
"3.9 MB": "revision.contributor.username_str",
"3.3 MB": "_version_",
"2.9 MB": "revision.contributor_str",
...
}
}
}
}
}
}
}
}
{code}
I attached outputs from the command that provide a summary breakdown per type
of data, and a really detailed breakdown including per-field and per-type
statistical summary.
> Raw index data analysis tool
> ----------------------------
>
> Key: SOLR-13512
> URL: https://issues.apache.org/jira/browse/SOLR-13512
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Priority: Major
> Attachments: SOLR-13512.patch
>
>
> A common question from Solr users is how to determine how a given schema
> field and all its related index data contributes to the total index size.
> It's possible to estimate this information by doing a single full pass
> through all index data, aggregating estimated sizes of terms, postings, doc
> values and stored fields. The totals represent of course the worst case
> scenario when there's no index compression at all, but still they should be
> useful for answering the questions above.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]