[
https://issues.apache.org/jira/browse/SOLR-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990572#comment-16990572
]
Yonik Seeley commented on SOLR-14013:
-------------------------------------
OK, found the primary issue.... it's an N^2 bug.
h3. Background:
SOLR-12885 changed SolrDocument (among other things) to make it so
that when used through certain interfaces, gets would auto-check-and-convert
keys & values of type Utf8CharSequence to String. This happened in methods
getFieldValueMap.get(), getFieldValueMap.keySet(), remove() and getValues()
h3. Issues:
SolrDocument does not auto-convert through many other of its methods,
such as .get(), .put(), .keySet(), so depending on this anywhere is extremely
fragile and will break if you change how you access SolrDocument
SolrInputField has some of the same issues as SolrDocument, the mere act of
doing a .get() on a multi-valued field (which should be O(1)) scans the entire
list for CharSequence and if it finds one, creates a new list and iterates over
the whole thing again to convert each element.
h3. Client side indexing:
And it's worse, because it looks like this auto-check-and-convert logic is even
triggered when the SolrJ is using JavaBinCodec to send documents... so even if
some field values were Utf8CharSequence to begin with, they would still be
converted to
String before being converted back to utf8 by JavaBinCodec!
h3. Server side indexing:
Then on the server side, JavaBinCodec parses String values as
Utf8CharSequence, and
we start going through the update processor chain.
FieldMutatingUpdateProcessor (used
in our _default config to remove blank values) asks each SolrInputField for its
value, which again triggers iteration over the complete list. Also, for *any*
string values (single valued too), FieldMutatingUpdateProcessor replaces those
Utf8CharSequence objects with String objects (destroying any attempted
re-serializing
optimization)
Then comes NestedUpdateProcessorFactory, which triggers the
auto-check-and-convert *twice*, because getValue() returned a pointer
previously, which would have been optimized away. Both lines below iterate
over all values, *before* the actual iteration by the explicit "for" loop:
{code:java}
boolean isSingleVal = !(field.getValue() instanceof Collection);
for(Objectval: field) {
{code}
then isAtomicUpdate(), and then finally writeSolrInputDocument() to convert to
JavaBin for the transaction log both trigger the extra
iterate-over-all-values with each inspection. If FieldMutatingUpdateProcessor
hadn't overwritten Utf8CharSequence already, all of these accesses would have
also triggered a new collection creation each time (and an additional
iteration to
create the new collection) for every multi-valued string field.
h3. Server side query:
On the query side, we get a Lucene Document, and then convert it into a
SolrDocument. Binary ResponseWriter uses MaskCharSeqSolrDocument which
inherits from SolrDocument to do the auto-convert-on-access stuff more
thoroughly.
{code:java}
for (IndexableField f : doc.getFields()) {
final String fname = f.name();
if (null == fieldNamesNeeded || fieldNamesNeeded.contains(fname) ) {
// Make sure multivalued fields are represented as lists
Object existing = out.get(fname);
{code}
For multi-valued fields, what we get back from lucene is actually a flat list
of all the values in the whole document. We need to collect all values
with the same field into a list. So if there are 1000 values in a
field, the outer loop executes over 1000 times. Then in the inner loop we
retrieve any existing value for the field by calling "out.get(fname)", which
triggers the auto-convert-on-access which scans all the values so far (on
average 1000/2),
and hence we have our O(N^2) behavior that the original poster reported.
h3. Other:
It took a really long time to review some of this code (and I've only reviewed
some), often because a lack of comments around non-obvious things. I thought
there might be lifetime/sharing bugs with BytesBlock for example, until I
realized that strings are appended in the block rather than placed at the
start.
Same issue for FastInputStream.readDirectUtf8... since it looked like it
was sharing the internal buffer, I thought there was a possible lifetime issue
there. A single line comment in both of those cases would have saved me quite
a bit.
Actually... looking at it again, there still may be a subtle sharing bug in
this new FastInputStream.readDirectUtf8. I can't say I quite understand the
logic behind for when you can't share the internal buffer.
{code:java}
if (in !=null || end < pos + len) return false;
{code}
You can only share the buffer when the bytes you want are right up at the end
of the buffer?
I'm not sure I understand the logic around that, but ChannelFastInputStream
(used by
TransactionLog) derives from FastInputStream and doesn't have an input stream,
so if it got unlucky and did a string read at the end of a buffer, there would
likely be data corruption.
> javabin performance regressions
> -------------------------------
>
> Key: SOLR-14013
> URL: https://issues.apache.org/jira/browse/SOLR-14013
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 7.7
> Reporter: Yonik Seeley
> Assignee: Yonik Seeley
> Priority: Major
> Attachments: test.json
>
>
> As noted by [~rrockenbaugh] in SOLR-13963, javabin also recently became
> orders of magnitude slower in certain cases since v7.7. The cases identified
> so far include large numbers of values in a field.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]