[jira] [Commented] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

Steve Rowe (JIRA) Mon, 05 Jan 2015 06:22:12 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264612#comment-14264612
 ]


Steve Rowe commented on SOLR-6666:
----------------------------------

+1 for Erick's patch.

Separately, I've long thought that iterating through each dynamic copy field 
looking for matches was non-optimal - I think this could be converted into one 
or more regular expressions, like maybe one for prefix globs and one for suffix 
globs.

{quote}
We have three other test failures, it's always best to run 'ant test' before 
putting up a patch. That said, I think the one I'm seeing (there are three, but 
they're all the same problem) is the following:

Steve Rowe I'm particularly interested in what you think here.

The trunk code returns this fragment in TestCopyFieldCollectionResource.java
{code:javascript}{ "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i", 
"dest":"title"}{code}
,
whereas the patched code returns:
{code:javascript}{ "source":"src_sub_no_ast_i", "dest":"title"}{code}
,
The schema.xml file (if I've got the right one) has this line:
{code:xml}<copyField source="src_sub_no_ast_i" dest="title"/>{code}
{quote}

Yeah the {{sourceDynamicBase}} is determined at schema parsing time and stored 
in a {{DynamicCopy}} instance.  While the optimization included in the patch 
removes this capability -- {{DynamicCopy}} is not used in {{copyFieldsMap}} -- 
the idea of a single source "base" is wrongheaded for at least two reasons: a) 
copyField source globs can match multiple dynamicField-s (as well as 
non-dynamic fields for that matter); and b) it's not always possible to 
identify the source field(s) at schema parsing time - see e.g. SOLR-4650.  
Anyway, there is no real loss here in dropping support for 
{{sourceDynamicBase}}.

bq. Like I said, the original code hurt my head, I suspect it was just wrong. 
Steve, do you have any comments here? Or am I mis-interpreting things?

The code isn't changed much, really?  Can you say why you think it was (is?) 
wrong?  The code is trying to handle lots of cases - see [the table on 
SOLR-4567|https://issues.apache.org/jira/browse/SOLR-4567?focusedCommentId=13600847&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13600847].
  That said, I think the code for handling dynamic copyField sources is more 
complicated than it needs to be: except for cases 14-16 in the above-linked 
table (copyField-s with a no-glob-dynamic-field-reference source and a dynamic 
field destination), it's not necessary to verify at schema parsing time that 
any (dynamic) field(s) are matched by copy field glob sources - see SOLR-4650 
for a concrete example.  I think {{sourceDynamicBase}} should go away entirely.


> Dynamic copy fields are considering all dynamic fields, causing a significant 
> performance impact on indexing documents
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6666
>                 URL: https://issues.apache.org/jira/browse/SOLR-6666
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis, update
>         Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
> specific CopyFields for dynamic fields, but without wildcards (the fields are 
> dynamic, the copy directive is not)
>            Reporter: Liram Vardi
>            Assignee: Erick Erickson
>         Attachments: SOLR-6666.patch, SOLR-6666.patch, SOLR-6666.patch
>
>
> Result:
> After applying a fix for this issue, tests which we conducted show more than 
> 40 percent improvement on our insertion performance.
> Explanation:
> Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. 
> This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
> following method, "getCopyFieldsList()":
> {code:title=getCopyFieldsList() |borderStyle=solid}
> final List<CopyField> result = new ArrayList<>();
>     for (DynamicCopy dynamicCopy : dynamicCopyFields) {
>       if (dynamicCopy.matches(sourceField)) {
>         result.add(new CopyField(getField(sourceField), 
> dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
>       }
>     }
>     List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
>     if (null != fixedCopyFields) {
>       result.addAll(fixedCopyFields);
>     }
> {code}
> This function tries to find for an input source field all its copyFields (All 
> its destinations which Solr need to move this field). 
> As you can probably note, the first part of the procedure is the procedure 
> most “expensive” step (takes O( n ) time while N is the size of the 
> "dynamicCopyFields" group).
> The next part is just a simple "hash" extraction, which takes O(1) time. 
> Our schema contains over then 500 copyFields but only 70 of then are 
> "indexed" fields. 
> We also have one dynamic field with  a wildcard ( * ), which "catches" the 
> rest of the document fields. 
> As you can conclude, we have more than 400 copyFields that are based on this 
> dynamicField but all, except one, are fixed (i.e. does not contain any 
> wildcard).
> From some reason, the copyFields registration procedure defines those 400 
> fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” 
> array, 
> This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
> justification: All of those 400 copyFields are not glob and therefore do not 
> need any complex pattern matching to the input field. They all can be store 
> at the "fixedCopyFields".
> Only copyFields with asterisks need this "special" treatment and they are 
> (especially on our case) pretty rare.  
> Therefore, we created a patch which fix this problem by changing the 
> registerCopyField() procedure.
> Test which we conducted show that there is no change in the Indexing results. 
> Moreover, the fix still successfully passes the class unit tests (i.e. 
> IndexSchemaTest.java).
>        



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

Reply via email to