[jira] [Commented] (SOLR-4589) 4.x + enableLazyFieldLoading + large nultivalued fields + varying fl = pathalogical CPU load & response time

Hoss Man (JIRA) Fri, 15 Mar 2013 14:04:13 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603845#comment-13603845
 ]


Hoss Man commented on SOLR-4589:
--------------------------------

bq. Lucene natively no longer has support for lazy field loading, but there is 
a "backwards layer" just for Solr in modules/misc (LazyDocument.java)

Yeah .. LazyDocument is seriously evil...

bq. The document does not use maps to lookup, if you have many fields its 
always a scan through the ArrayList of all fields in the document.

It's worse then that though -- having many fields and scanning through them all 
to fetch a single field value (or an array of field values for a single name) 
is a cost that has to be paid in 4.x regardless of whether you are using 
LazyDocument or not, the root problem here seems to be having fields that 
contain many _values_.  That problem is exacerbated by the fact that unlike 3.x 
lazy loading, 4.x LaxyDocument/LazyField doesn't do anything to "cache" the 
fields you've already asked for.

Below are my notes from investigating this and trying to get up to speed on the 
new world order of document loading w/o FieldSelector.  I'll experiment with 
some fixes after i get some food...

----


LUCENE-2308 - r1162347
* IndexReader.doc(int,FieldSelector) deleted
* FieldSelector moved to misc
* new concept StoredFieldVisitor introduced
** void IndexReader.document(int docID, StoredFieldVisitor visitor)
* new impl DocumentStoredFieldVisitor extends StoredFieldVisitor
* new impl FieldSelectorVisitor extends StoredFieldVisitor 
** appears all the old FieldSelector logic from IndexReader moved here?
** contains a private "LazyField extends Field" that caches field values once 
fetched
* SolrIndexSearcher modified to use FieldSelectorVisitor

LUCENE-2621 - r1199779
* eliminates FieldSelector & FieldSelectorVisitor
* leaves StoredFieldVisitor & DocumentStoredFieldVisitor intact
* introduced public LazyDocument containing "LazyField implements 
IndexableField"
** this version of LazyField does _not_ cache any data once fetched
* changes SolrIndexSearcher's SetNonLazyFieldSelector to extend 
StoredFieldVisitor
** add's LazyField to the Document for any fields not immediately needed


The crux of the problem is that:
* LazyDocument is lazy about loading the doc, but once you ask for the value 
any LazyField, the entire Document (with all underlying IndexableField values) 
is loaded.
* even though the entire document has been loaded once a single LazyField is 
used, the performance of iterating over LazyField's is *TERRIBLE* when there 
are lots of values for a single field
* requests for the value of individual LazyFields are not cached/stored 
anywhere, so the poor performace affects all subsequent re-uses of the same 
LazyDocuments

Details...

The state tracked in a LazyField is a refrence back to the underlying 
LazyDocument, the field name, and the "num" offset of this IndexableField in 
the list of values for that field name.  When you ask the LazyField for it's 
value, it asks the underlying LazyDocument to fetch the entire Document (if it 
hasn't already) and then it asks that Document for _all_ values of the 
assocaited field name as an arry, and then it looks up it's "num" offset in 
that array.

So if you build up an (outer) Document containing N LazyField instances for 
field named "foo" (as is done in Solr's SetNonLazyFieldSelector), and then try 
to iterate over the values with something like {{String[] values = 
outerDoc.getValues("foo");}} under the covers LazyField will load every value 
of every field of that document into memory as an "innerDoc", that innerDoc 
will be asked N times to generate a new IndexableField[] of every value of 
field "foo" (which BTW: involves iterating over every IndexableField value of 
every field) and N-1 elements of that array will then be ignored and thrown 
away.

----

It's not clear to me why FieldSelectorVisitor was eliminated in LUCENE-2621 (no 
discussion in the comments on point) but it's also not clear to me why 
LazyDocument+LazyField would ever be a good idea in any application that had 
more then a handful of fields (and if you don't have very many fields, why are 
you lazy loading?).  

It's also not clear to me why the LazyDocument version of LazyField doesn't 
include the same caching logic as the version that was included in 
FieldSelectorVisitor (or the older lazy loading code in 3.6) because w/o that 
the usage pattern in Solr -- in which Document objects are cached -- results in 
the worst of all possible worlds: once a Document is cached with only a small 
subset of "real" fields, and the rest are "LazyField" instances, every 
subsequent request for that document that involves those LazyFields is slow, 
even if they ask for the same LazyField over and over.

                
> 4.x + enableLazyFieldLoading + large nultivalued fields + varying fl = 
> pathalogical CPU load & response time
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4589
>                 URL: https://issues.apache.org/jira/browse/SOLR-4589
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.0, 4.1, 4.2
>            Reporter: Hoss Man
>         Attachments: test-just-queries.out__4.0.0_mmap_lazy_using36index.txt, 
> test-just-queries.sh, test.out__3.6.1_mmap_lazy.txt, 
> test.out__3.6.1_mmap_nolazy.txt, test.out__3.6.1_nio_lazy.txt, 
> test.out__3.6.1_nio_nolazy.txt, test.out__4.0.0_mmap_lazy.txt, 
> test.out__4.0.0_mmap_nolazy.txt, test.out__4.0.0_nio_lazy.txt, 
> test.out__4.0.0_nio_nolazy.txt, test.out__4.2.0_mmap_lazy.txt, 
> test.out__4.2.0_mmap_nolazy.txt, test.out__4.2.0_nio_lazy.txt, 
> test.out__4.2.0_nio_nolazy.txt, test.sh
>
>
> Following up on a [user report of exterme CPU usage in 
> 4.1|http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%[email protected]%3E],
>  I've discovered that the following combination of factors can result in 
> extreme CPU usage and excessively HTTP response times...
> * Solr 4.x (tested 3.6.1, 4.0.0, and 4.2.0)
> * enableLazyFieldLoading == true (included in example solrconfig.xml)
> * documents with a large number of values in multivalued fields (eg: tested 
> ~10-15K values)
> * multiple requests returning the same doc with different "fl" lists
> I haven't dug into the route cause yet, but the essential observations is: if 
> lazyloading is used in 4.x, then once a document has been fetched with an 
> initial fl list X, subsequent requests for that document using a differnet fl 
> list Y can be many orders of magnitute slower (while pegging the CPU) -- even 
> if those same requests using fl Y uncached (or w/o lazy laoding) would be 
> extremely fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4589) 4.x + enableLazyFieldLoading + large nultivalued fields + varying fl = pathalogical CPU load & response time

Reply via email to