[jira] [Issue Comment Edited] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr

Christopher Ball (JIRA) Fri, 23 Sep 2011 19:17:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113611#comment-13113611
 ]


Christopher Ball edited comment on LUCENE-3435 at 9/24/11 2:15 AM:
-------------------------------------------------------------------

Grant - Great start =)

Below is some initial feedback (happy to help further if you want to chat in 
real-time) 

*Quickly Groking* - To make it easier to quickly comprehend, the cells that are 
to be updated in the spreadsheet should be color coded (as opposed to those 
that are calculated)  

*Bytes or Entries* - You list Max Size for filterCache, queryResultCache, and 
documentCache as 512 which subtle implies the size is based on bytes when the 
units of the cache are actually the number of entries. I would clarify the unit 
of measure (I've seen numerous blogs and emails confuse this).

*Approach to Cache Sizing* - Given memory requirements are heavily contingent 
on caching I would suggest including at least one approach for how to determine 
cache size

* Query Result Cache
** Estimation: should be greater than 'number of commonly reoccurring unique 
queries' x 'number of sort parameters' x 'number of possible sort orders' 
* Document Cache
** Estimation: should be greater than 'maximum number of documents per query' x 
'maximum number of concurrent queries'
* Filter Cache
** Estimation: should be number of unique filter queries (should clarify what 
constitutes 'unique')
* Field Value Cache
** Estimation: should be ?
* Custom Caches
** Estimation: should be ? - A common use case?

*Faceting* - Surprised there is no reference to use of faceting which is both 
increasingly common default query functionality and further increases memory 
requirements for effective use

*Obscure Metrics* - To really give this spreadsheet some teeth, there really 
should be pointers for at least one approach on how to estimate each input 
metric (could be on another tab). 

* Some are fairly easy: 
** Number of Unique Terms / field
** Number of documents
** Number of indexed fields (no norms)
** Number of fields w/ norms
** Number of non-String Sort Fields other than score
** Number of String Sort Fields
** Number of deleted docs on avg
** Avg. number of terms per query

* Some are quite obscure (and guidance on how to estimate is essential):
** Numberof RAM-based Column Stride Fields (DocValues)
** ramBufferSizeMB
** Transient Factor (MB)
** fieldValueCache Max Size
** Custom Cache Size (MB)
** Avg. Number of Bytes per Term
** Bytes/Term
** Field Cache bits/term
** Cache Key Avg. Size (Bytes)
** Avg QueryResultKey size (in bytes)

      was (Author: christopherball):
    Grant - Great start =)

Below is some initial feedback (happy to help further if you want to chat in 
real-time) 

*Quickly Groking* - To make it easier to quickly comprehend, the cells that are 
to be updated in the spreadsheet should be color coded (as opposed to those 
that are calculated)  

*Bytes or Entries* - You list Max Size for filterCache, queryResultCache, and 
documentCache as 512 which implies the size is based on bytes when in fact the 
units of the cache are entries - I would clarify this in the spreadsheet as I 
have seen numerous blogs and emails confuse this.

*Approach to Cache Sizing* - Given memory requirements are heavily contingent 
on caching I would suggest including at least one approach for how to determine 
cache size

* Query Result Cache
** Estimation: should be greater than 'number of commonly reoccurring unique 
queries' x 'number of sort parameters' x 'number of possible sort orders' 
* Document Cache
** Estimation: should be greater than 'maximum number of documents per query' x 
'maximum number of concurrent queries'
* Filter Cache
** Estimation: should be number of unique filter queries (should clarify what 
constitutes 'unique')
* Field Value Cache
** Estimation: should be ?
* Custom Caches
** Estimation: should be ? - A common use case?

*Faceting* - Surprised there is no reference to use of faceting which is 
increasingly common default query functionality would further increase memory 
requirements for effective use

*Obscure Metrics* - To really give this spreadsheet some teeth, there really 
should be pointers for at least one approach on how to estimate each input 
metric (could be on another tab). 

* Some are fairly easy: 
** Number of Unique Terms / field
** Number of documents
** Number of indexed fields (no norms)
** Number of fields w/ norms
** Number of non-String Sort Fields other than score
** Number of String Sort Fields
** Number of deleted docs on avg
** Avg. number of terms per query

* Some are quite obscure (and guidance on how to estimate is essential):
** Numberof RAM-based Column Stride Fields (DocValues)
** ramBufferSizeMB
** Transient Factor (MB)
** fieldValueCache Max Size
** Custom Cache Size (MB)
** Avg. Number of Bytes per Term
** Bytes/Term
** Field Cache bits/term
** Cache Key Avg. Size (Bytes)
** Avg QueryResultKey size (in bytes)
  
> Create a Size Estimator model for Lucene and Solr
> -------------------------------------------------
>
>                 Key: LUCENE-3435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3435
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: core/other
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>
> It is often handy to be able to estimate the amount of memory and disk space 
> that both Lucene and Solr use, given certain assumptions.  I intend to check 
> in an Excel spreadsheet that allows people to estimate memory and disk usage 
> for trunk.  I propose to put it under dev-tools, as I don't think it should 
> be official documentation just yet and like the IDE stuff, we'll see how well 
> it gets maintained.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Issue Comment Edited] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr

Reply via email to