[
https://issues.apache.org/jira/browse/SOLR-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603845#comment-13603845
]
Hoss Man commented on SOLR-4589:
--------------------------------
bq. Lucene natively no longer has support for lazy field loading, but there is
a "backwards layer" just for Solr in modules/misc (LazyDocument.java)
Yeah .. LazyDocument is seriously evil...
bq. The document does not use maps to lookup, if you have many fields its
always a scan through the ArrayList of all fields in the document.
It's worse then that though -- having many fields and scanning through them all
to fetch a single field value (or an array of field values for a single name)
is a cost that has to be paid in 4.x regardless of whether you are using
LazyDocument or not, the root problem here seems to be having fields that
contain many _values_. That problem is exacerbated by the fact that unlike 3.x
lazy loading, 4.x LaxyDocument/LazyField doesn't do anything to "cache" the
fields you've already asked for.
Below are my notes from investigating this and trying to get up to speed on the
new world order of document loading w/o FieldSelector. I'll experiment with
some fixes after i get some food...
----
LUCENE-2308 - r1162347
* IndexReader.doc(int,FieldSelector) deleted
* FieldSelector moved to misc
* new concept StoredFieldVisitor introduced
** void IndexReader.document(int docID, StoredFieldVisitor visitor)
* new impl DocumentStoredFieldVisitor extends StoredFieldVisitor
* new impl FieldSelectorVisitor extends StoredFieldVisitor
** appears all the old FieldSelector logic from IndexReader moved here?
** contains a private "LazyField extends Field" that caches field values once
fetched
* SolrIndexSearcher modified to use FieldSelectorVisitor
LUCENE-2621 - r1199779
* eliminates FieldSelector & FieldSelectorVisitor
* leaves StoredFieldVisitor & DocumentStoredFieldVisitor intact
* introduced public LazyDocument containing "LazyField implements
IndexableField"
** this version of LazyField does _not_ cache any data once fetched
* changes SolrIndexSearcher's SetNonLazyFieldSelector to extend
StoredFieldVisitor
** add's LazyField to the Document for any fields not immediately needed
The crux of the problem is that:
* LazyDocument is lazy about loading the doc, but once you ask for the value
any LazyField, the entire Document (with all underlying IndexableField values)
is loaded.
* even though the entire document has been loaded once a single LazyField is
used, the performance of iterating over LazyField's is *TERRIBLE* when there
are lots of values for a single field
* requests for the value of individual LazyFields are not cached/stored
anywhere, so the poor performace affects all subsequent re-uses of the same
LazyDocuments
Details...
The state tracked in a LazyField is a refrence back to the underlying
LazyDocument, the field name, and the "num" offset of this IndexableField in
the list of values for that field name. When you ask the LazyField for it's
value, it asks the underlying LazyDocument to fetch the entire Document (if it
hasn't already) and then it asks that Document for _all_ values of the
assocaited field name as an arry, and then it looks up it's "num" offset in
that array.
So if you build up an (outer) Document containing N LazyField instances for
field named "foo" (as is done in Solr's SetNonLazyFieldSelector), and then try
to iterate over the values with something like {{String[] values =
outerDoc.getValues("foo");}} under the covers LazyField will load every value
of every field of that document into memory as an "innerDoc", that innerDoc
will be asked N times to generate a new IndexableField[] of every value of
field "foo" (which BTW: involves iterating over every IndexableField value of
every field) and N-1 elements of that array will then be ignored and thrown
away.
----
It's not clear to me why FieldSelectorVisitor was eliminated in LUCENE-2621 (no
discussion in the comments on point) but it's also not clear to me why
LazyDocument+LazyField would ever be a good idea in any application that had
more then a handful of fields (and if you don't have very many fields, why are
you lazy loading?).
It's also not clear to me why the LazyDocument version of LazyField doesn't
include the same caching logic as the version that was included in
FieldSelectorVisitor (or the older lazy loading code in 3.6) because w/o that
the usage pattern in Solr -- in which Document objects are cached -- results in
the worst of all possible worlds: once a Document is cached with only a small
subset of "real" fields, and the rest are "LazyField" instances, every
subsequent request for that document that involves those LazyFields is slow,
even if they ask for the same LazyField over and over.
> 4.x + enableLazyFieldLoading + large nultivalued fields + varying fl =
> pathalogical CPU load & response time
> ------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-4589
> URL: https://issues.apache.org/jira/browse/SOLR-4589
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.0, 4.1, 4.2
> Reporter: Hoss Man
> Attachments: test-just-queries.out__4.0.0_mmap_lazy_using36index.txt,
> test-just-queries.sh, test.out__3.6.1_mmap_lazy.txt,
> test.out__3.6.1_mmap_nolazy.txt, test.out__3.6.1_nio_lazy.txt,
> test.out__3.6.1_nio_nolazy.txt, test.out__4.0.0_mmap_lazy.txt,
> test.out__4.0.0_mmap_nolazy.txt, test.out__4.0.0_nio_lazy.txt,
> test.out__4.0.0_nio_nolazy.txt, test.out__4.2.0_mmap_lazy.txt,
> test.out__4.2.0_mmap_nolazy.txt, test.out__4.2.0_nio_lazy.txt,
> test.out__4.2.0_nio_nolazy.txt, test.sh
>
>
> Following up on a [user report of exterme CPU usage in
> 4.1|http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%[email protected]%3E],
> I've discovered that the following combination of factors can result in
> extreme CPU usage and excessively HTTP response times...
> * Solr 4.x (tested 3.6.1, 4.0.0, and 4.2.0)
> * enableLazyFieldLoading == true (included in example solrconfig.xml)
> * documents with a large number of values in multivalued fields (eg: tested
> ~10-15K values)
> * multiple requests returning the same doc with different "fl" lists
> I haven't dug into the route cause yet, but the essential observations is: if
> lazyloading is used in 4.x, then once a document has been fetched with an
> initial fl list X, subsequent requests for that document using a differnet fl
> list Y can be many orders of magnitute slower (while pegging the CPU) -- even
> if those same requests using fl Y uncached (or w/o lazy laoding) would be
> extremely fast.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]