[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-: - Attachment: SOLR-.patch Same patch with CHANGES.txt added. Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents -- Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Improvement Components: Schema and Analysis, update Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 specific CopyFields for dynamic fields, but without wildcards (the fields are dynamic, the copy directive is not) Reporter: Liram Vardi Assignee: Erick Erickson Attachments: SOLR-.patch, SOLR-.patch, SOLR-.patch, SOLR-.patch Result: After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance. Explanation: Using JVM profiler, we found a CPU bottleneck during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, getCopyFieldsList(): {code:title=getCopyFieldsList() |borderStyle=solid} final ListCopyField result = new ArrayList(); for (DynamicCopy dynamicCopy : dynamicCopyFields) { if (dynamicCopy.matches(sourceField)) { result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); } } ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField); if (null != fixedCopyFields) { result.addAll(fixedCopyFields); } {code} This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field). As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O( n ) time while N is the size of the dynamicCopyFields group). The next part is just a simple hash extraction, which takes O(1) time. Our schema contains over then 500 copyFields but only 70 of then are indexed fields. We also have one dynamic field with a wildcard ( * ), which catches the rest of the document fields. As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard). From some reason, the copyFields registration procedure defines those 400 fields as DynamicCopyField and then store them in the “dynamicCopyFields” array, This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the fixedCopyFields. Only copyFields with asterisks need this special treatment and they are (especially on our case) pretty rare. Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure. Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-: - Attachment: SOLR-.patch Hmmm, I like the way you've broken things out, it makes the code easier to follow. It gave me a headache looking the original code before either patch. We have three other test failures, it's always best to run 'ant test' before putting up a patch. That said, I think the one I'm seeing (there are three, but they're all the same problem) is the following: [~sar...@syr.edu] I'm particularly interested in what you think here. The trunk code returns this fragment in TestCopyFieldCollectionResource.java { source:src_sub_no_ast_i, sourceDynamicBase:*_i, dest:title}, whereas the patched code returns: { source:src_sub_no_ast_i, dest:title}, The schema.xml file (if I've got the right one) has this line: copyField source=src_sub_no_ast_i dest=title/ Like I said, the original code hurt my head, I suspect it was just wrong. Steve, do you have any comments here? Or am I mis-interpreting things? The attached patch fixes these three problems, I'll run the whole test suite again too. Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents -- Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Improvement Components: Schema and Analysis, update Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 specific CopyFields for dynamic fields, but without wildcards (the fields are dynamic, the copy directive is not) Reporter: Liram Vardi Assignee: Erick Erickson Attachments: SOLR-.patch, SOLR-.patch, SOLR-.patch Result: After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance. Explanation: Using JVM profiler, we found a CPU bottleneck during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, getCopyFieldsList(): {code:title=getCopyFieldsList() |borderStyle=solid} final ListCopyField result = new ArrayList(); for (DynamicCopy dynamicCopy : dynamicCopyFields) { if (dynamicCopy.matches(sourceField)) { result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); } } ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField); if (null != fixedCopyFields) { result.addAll(fixedCopyFields); } {code} This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field). As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O( n ) time while N is the size of the dynamicCopyFields group). The next part is just a simple hash extraction, which takes O(1) time. Our schema contains over then 500 copyFields but only 70 of then are indexed fields. We also have one dynamic field with a wildcard ( * ), which catches the rest of the document fields. As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard). From some reason, the copyFields registration procedure defines those 400 fields as DynamicCopyField and then store them in the “dynamicCopyFields” array, This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the fixedCopyFields. Only copyFields with asterisks need this special treatment and they are (especially on our case) pretty rare. Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure. Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elran Dvir updated SOLR-: - Attachment: SOLR-.patch Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents -- Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Improvement Components: Schema and Analysis, update Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 specific CopyFields for dynamic fields, but without wildcards (the fields are dynamic, the copy directive is not) Reporter: Liram Vardi Assignee: Erick Erickson Attachments: SOLR-.patch, SOLR-.patch Result: After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance. Explanation: Using JVM profiler, we found a CPU bottleneck during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, getCopyFieldsList(): {code:title=getCopyFieldsList() |borderStyle=solid} final ListCopyField result = new ArrayList(); for (DynamicCopy dynamicCopy : dynamicCopyFields) { if (dynamicCopy.matches(sourceField)) { result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); } } ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField); if (null != fixedCopyFields) { result.addAll(fixedCopyFields); } {code} This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field). As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O( n ) time while N is the size of the dynamicCopyFields group). The next part is just a simple hash extraction, which takes O(1) time. Our schema contains over then 500 copyFields but only 70 of then are indexed fields. We also have one dynamic field with a wildcard ( * ), which catches the rest of the document fields. As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard). From some reason, the copyFields registration procedure defines those 400 fields as DynamicCopyField and then store them in the “dynamicCopyFields” array, This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the fixedCopyFields. Only copyFields with asterisks need this special treatment and they are (especially on our case) pretty rare. Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure. Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liram Vardi updated SOLR-: -- Description: Result: After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance. Explanation: Using JVM profiler, we found a CPU bottleneck during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, getCopyFieldsList(): {code:title=getCopyFieldsList() |borderStyle=solid} final ListCopyField result = new ArrayList(); for (DynamicCopy dynamicCopy : dynamicCopyFields) { if (dynamicCopy.matches(sourceField)) { result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); } } ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField); if (null != fixedCopyFields) { result.addAll(fixedCopyFields); } {code} This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field). As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O( n ) time while N is the size of the dynamicCopyFields group). The next part is just a simple hash extraction, which takes O(1) time. Our schema contains over then 500 copyFields but only 70 of then are indexed fields. We also have one dynamic field with a wildcard ( * ), which catches the rest of the document fields. As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard). From some reason, the copyFields registration procedure defines those 400 fields as DynamicCopyField and then store them in the “dynamicCopyFields” array, This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the fixedCopyFields. Only copyFields with asterisks need this special treatment and they are (especially on our case) pretty rare. Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure. Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java). was: Result: After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance. Explanation: Using JVM profiler, we found a CPU bottleneck during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, getCopyFieldsList(): {code:title=getCopyFieldsList() |borderStyle=solid} final ListCopyField result = new ArrayList(); for (DynamicCopy dynamicCopy : dynamicCopyFields) { if (dynamicCopy.matches(sourceField)) { result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); } } ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField); if (null != fixedCopyFields) { result.addAll(fixedCopyFields); } {code} This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field). As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O(n) time while N is the size of the dynamicCopyFields group). The next part is just a simple hash extraction, which takes O(1) time. Our schema contains over then 500 copyFields but only 70 of then are indexed fields. We also have one dynamic field with a wildcard (*), which catches the rest of the document fields. As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard). From some reason, the copyFields registration procedure defines those 400 fields as DynamicCopyField and then store them in the “dynamicCopyFields” array, This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the fixedCopyFields. Only copyFields with asterisks need this special treatment and they are (especially on our case) pretty rare. Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure. Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java). Dynamic copy
[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liram Vardi updated SOLR-: -- Attachment: SOLR-.patch Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents -- Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Improvement Components: Schema and Analysis, update Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 specific CopyFields for dynamic fields, but without wildcards (the fields are dynamic, the copy directive is not) Reporter: Liram Vardi Attachments: SOLR-.patch Result: After applying a fix for this issue, tests which we conducted show more than 40 percent improvement on our insertion performance. Explanation: Using JVM profiler, we found a CPU bottleneck during Solr indexing process. This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the following method, getCopyFieldsList(): {code:title=getCopyFieldsList() |borderStyle=solid} final ListCopyField result = new ArrayList(); for (DynamicCopy dynamicCopy : dynamicCopyFields) { if (dynamicCopy.matches(sourceField)) { result.add(new CopyField(getField(sourceField), dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); } } ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField); if (null != fixedCopyFields) { result.addAll(fixedCopyFields); } {code} This function tries to find for an input source field all its copyFields (All its destinations which Solr need to move this field). As you can probably note, the first part of the procedure is the procedure most “expensive” step (takes O( n ) time while N is the size of the dynamicCopyFields group). The next part is just a simple hash extraction, which takes O(1) time. Our schema contains over then 500 copyFields but only 70 of then are indexed fields. We also have one dynamic field with a wildcard ( * ), which catches the rest of the document fields. As you can conclude, we have more than 400 copyFields that are based on this dynamicField but all, except one, are fixed (i.e. does not contain any wildcard). From some reason, the copyFields registration procedure defines those 400 fields as DynamicCopyField and then store them in the “dynamicCopyFields” array, This step makes getCopyFieldsList() very expensive (in CPU terms) without any justification: All of those 400 copyFields are not glob and therefore do not need any complex pattern matching to the input field. They all can be store at the fixedCopyFields. Only copyFields with asterisks need this special treatment and they are (especially on our case) pretty rare. Therefore, we created a patch which fix this problem by changing the registerCopyField() procedure. Test which we conducted show that there is no change in the Indexing results. Moreover, the fix still successfully passes the class unit tests (i.e. IndexSchemaTest.java). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org