[ https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264612#comment-14264612 ]
Steve Rowe commented on SOLR-6666: ---------------------------------- +1 for Erick's patch. Separately, I've long thought that iterating through each dynamic copy field looking for matches was non-optimal - I think this could be converted into one or more regular expressions, like maybe one for prefix globs and one for suffix globs. {quote} We have three other test failures, it's always best to run 'ant test' before putting up a patch. That said, I think the one I'm seeing (there are three, but they're all the same problem) is the following: Steve Rowe I'm particularly interested in what you think here. The trunk code returns this fragment in TestCopyFieldCollectionResource.java {code:javascript}{ "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i", "dest":"title"}{code} , whereas the patched code returns: {code:javascript}{ "source":"src_sub_no_ast_i", "dest":"title"}{code} , The schema.xml file (if I've got the right one) has this line: {code:xml}<copyField source="src_sub_no_ast_i" dest="title"/>{code} {quote} Yeah the {{sourceDynamicBase}} is determined at schema parsing time and stored in a {{DynamicCopy}} instance. While the optimization included in the patch removes this capability -- {{DynamicCopy}} is not used in {{copyFieldsMap}} -- the idea of a single source "base" is wrongheaded for at least two reasons: a) copyField source globs can match multiple dynamicField-s (as well as non-dynamic fields for that matter); and b) it's not always possible to identify the source field(s) at schema parsing time - see e.g. SOLR-4650. Anyway, there is no real loss here in dropping support for {{sourceDynamicBase}}. bq. Like I said, the original code hurt my head, I suspect it was just wrong. Steve, do you have any comments here? Or am I mis-interpreting things? The code isn't changed much, really? Can you say why you think it was (is?) wrong? The code is trying to handle lots of cases - see [the table on SOLR-4567|https://issues.apache.org/jira/browse/SOLR-4567?focusedCommentId=13600847&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13600847]. That said, I think the code for handling dynamic copyField sources is more complicated than it needs to be: except for cases 14-16 in the above-linked table (copyField-s with a no-glob-dynamic-field-reference source and a dynamic field destination), it's not necessary to verify at schema parsing time that any (dynamic) field(s) are matched by copy field glob sources - see SOLR-4650 for a concrete example. I think {{sourceDynamicBase}} should go away entirely. > Dynamic copy fields are considering all dynamic fields, causing a significant > performance impact on indexing documents > ---------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-6666 > URL: https://issues.apache.org/jira/browse/SOLR-6666 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis, update > Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 > specific CopyFields for dynamic fields, but without wildcards (the fields are > dynamic, the copy directive is not) > Reporter: Liram Vardi > Assignee: Erick Erickson > Attachments: SOLR-6666.patch, SOLR-6666.patch, SOLR-6666.patch > > > Result: > After applying a fix for this issue, tests which we conducted show more than > 40 percent improvement on our insertion performance. > Explanation: > Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. > This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the > following method, "getCopyFieldsList()": > {code:title=getCopyFieldsList() |borderStyle=solid} > final List<CopyField> result = new ArrayList<>(); > for (DynamicCopy dynamicCopy : dynamicCopyFields) { > if (dynamicCopy.matches(sourceField)) { > result.add(new CopyField(getField(sourceField), > dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); > } > } > List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField); > if (null != fixedCopyFields) { > result.addAll(fixedCopyFields); > } > {code} > This function tries to find for an input source field all its copyFields (All > its destinations which Solr need to move this field). > As you can probably note, the first part of the procedure is the procedure > most “expensive” step (takes O( n ) time while N is the size of the > "dynamicCopyFields" group). > The next part is just a simple "hash" extraction, which takes O(1) time. > Our schema contains over then 500 copyFields but only 70 of then are > "indexed" fields. > We also have one dynamic field with a wildcard ( * ), which "catches" the > rest of the document fields. > As you can conclude, we have more than 400 copyFields that are based on this > dynamicField but all, except one, are fixed (i.e. does not contain any > wildcard). > From some reason, the copyFields registration procedure defines those 400 > fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” > array, > This step makes getCopyFieldsList() very expensive (in CPU terms) without any > justification: All of those 400 copyFields are not glob and therefore do not > need any complex pattern matching to the input field. They all can be store > at the "fixedCopyFields". > Only copyFields with asterisks need this "special" treatment and they are > (especially on our case) pretty rare. > Therefore, we created a patch which fix this problem by changing the > registerCopyField() procedure. > Test which we conducted show that there is no change in the Indexing results. > Moreover, the fix still successfully passes the class unit tests (i.e. > IndexSchemaTest.java). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org