[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

2015-01-05 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-:
-
Attachment: SOLR-.patch

Same patch with CHANGES.txt added.

 Dynamic copy fields are considering all dynamic fields, causing a significant 
 performance impact on indexing documents
 --

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis, update
 Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
 specific CopyFields for dynamic fields, but without wildcards (the fields are 
 dynamic, the copy directive is not)
Reporter: Liram Vardi
Assignee: Erick Erickson
 Attachments: SOLR-.patch, SOLR-.patch, SOLR-.patch, 
 SOLR-.patch


 Result:
 After applying a fix for this issue, tests which we conducted show more than 
 40 percent improvement on our insertion performance.
 Explanation:
 Using JVM profiler, we found a CPU bottleneck during Solr indexing process. 
 This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
 following method, getCopyFieldsList():
 {code:title=getCopyFieldsList() |borderStyle=solid}
 final ListCopyField result = new ArrayList();
 for (DynamicCopy dynamicCopy : dynamicCopyFields) {
   if (dynamicCopy.matches(sourceField)) {
 result.add(new CopyField(getField(sourceField), 
 dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
   }
 }
 ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField);
 if (null != fixedCopyFields) {
   result.addAll(fixedCopyFields);
 }
 {code}
 This function tries to find for an input source field all its copyFields (All 
 its destinations which Solr need to move this field). 
 As you can probably note, the first part of the procedure is the procedure 
 most “expensive” step (takes O( n ) time while N is the size of the 
 dynamicCopyFields group).
 The next part is just a simple hash extraction, which takes O(1) time. 
 Our schema contains over then 500 copyFields but only 70 of then are 
 indexed fields. 
 We also have one dynamic field with  a wildcard ( * ), which catches the 
 rest of the document fields. 
 As you can conclude, we have more than 400 copyFields that are based on this 
 dynamicField but all, except one, are fixed (i.e. does not contain any 
 wildcard).
 From some reason, the copyFields registration procedure defines those 400 
 fields as DynamicCopyField  and then store them in the “dynamicCopyFields” 
 array, 
 This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
 justification: All of those 400 copyFields are not glob and therefore do not 
 need any complex pattern matching to the input field. They all can be store 
 at the fixedCopyFields.
 Only copyFields with asterisks need this special treatment and they are 
 (especially on our case) pretty rare.  
 Therefore, we created a patch which fix this problem by changing the 
 registerCopyField() procedure.
 Test which we conducted show that there is no change in the Indexing results. 
 Moreover, the fix still successfully passes the class unit tests (i.e. 
 IndexSchemaTest.java).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

2014-12-30 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-:
-
Attachment: SOLR-.patch

Hmmm, I like the way you've broken things out, it makes the code easier to 
follow. It gave me a headache looking the original code before either patch.

We have three other test failures, it's always best to run 'ant test' before 
putting up a patch. That said, I think the one I'm seeing (there are three, but 
they're all the same problem) is the following:

[~sar...@syr.edu] I'm particularly interested in what you think here.

The trunk code returns this fragment in TestCopyFieldCollectionResource.java
{
  source:src_sub_no_ast_i,
  sourceDynamicBase:*_i,
  dest:title},

whereas the patched code returns:
{
  source:src_sub_no_ast_i,
  dest:title},

The schema.xml file (if I've got the right one) has this line:
   copyField source=src_sub_no_ast_i dest=title/

Like I said, the original code hurt my head, I suspect it was just wrong. 
Steve, do you have any comments here? Or am I mis-interpreting things?

The attached patch fixes these three problems, I'll run the whole test suite 
again too.


 Dynamic copy fields are considering all dynamic fields, causing a significant 
 performance impact on indexing documents
 --

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis, update
 Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
 specific CopyFields for dynamic fields, but without wildcards (the fields are 
 dynamic, the copy directive is not)
Reporter: Liram Vardi
Assignee: Erick Erickson
 Attachments: SOLR-.patch, SOLR-.patch, SOLR-.patch


 Result:
 After applying a fix for this issue, tests which we conducted show more than 
 40 percent improvement on our insertion performance.
 Explanation:
 Using JVM profiler, we found a CPU bottleneck during Solr indexing process. 
 This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
 following method, getCopyFieldsList():
 {code:title=getCopyFieldsList() |borderStyle=solid}
 final ListCopyField result = new ArrayList();
 for (DynamicCopy dynamicCopy : dynamicCopyFields) {
   if (dynamicCopy.matches(sourceField)) {
 result.add(new CopyField(getField(sourceField), 
 dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
   }
 }
 ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField);
 if (null != fixedCopyFields) {
   result.addAll(fixedCopyFields);
 }
 {code}
 This function tries to find for an input source field all its copyFields (All 
 its destinations which Solr need to move this field). 
 As you can probably note, the first part of the procedure is the procedure 
 most “expensive” step (takes O( n ) time while N is the size of the 
 dynamicCopyFields group).
 The next part is just a simple hash extraction, which takes O(1) time. 
 Our schema contains over then 500 copyFields but only 70 of then are 
 indexed fields. 
 We also have one dynamic field with  a wildcard ( * ), which catches the 
 rest of the document fields. 
 As you can conclude, we have more than 400 copyFields that are based on this 
 dynamicField but all, except one, are fixed (i.e. does not contain any 
 wildcard).
 From some reason, the copyFields registration procedure defines those 400 
 fields as DynamicCopyField  and then store them in the “dynamicCopyFields” 
 array, 
 This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
 justification: All of those 400 copyFields are not glob and therefore do not 
 need any complex pattern matching to the input field. They all can be store 
 at the fixedCopyFields.
 Only copyFields with asterisks need this special treatment and they are 
 (especially on our case) pretty rare.  
 Therefore, we created a patch which fix this problem by changing the 
 registerCopyField() procedure.
 Test which we conducted show that there is no change in the Indexing results. 
 Moreover, the fix still successfully passes the class unit tests (i.e. 
 IndexSchemaTest.java).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

2014-12-21 Thread Elran Dvir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elran Dvir updated SOLR-:
-
Attachment: SOLR-.patch

 Dynamic copy fields are considering all dynamic fields, causing a significant 
 performance impact on indexing documents
 --

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis, update
 Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
 specific CopyFields for dynamic fields, but without wildcards (the fields are 
 dynamic, the copy directive is not)
Reporter: Liram Vardi
Assignee: Erick Erickson
 Attachments: SOLR-.patch, SOLR-.patch


 Result:
 After applying a fix for this issue, tests which we conducted show more than 
 40 percent improvement on our insertion performance.
 Explanation:
 Using JVM profiler, we found a CPU bottleneck during Solr indexing process. 
 This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
 following method, getCopyFieldsList():
 {code:title=getCopyFieldsList() |borderStyle=solid}
 final ListCopyField result = new ArrayList();
 for (DynamicCopy dynamicCopy : dynamicCopyFields) {
   if (dynamicCopy.matches(sourceField)) {
 result.add(new CopyField(getField(sourceField), 
 dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
   }
 }
 ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField);
 if (null != fixedCopyFields) {
   result.addAll(fixedCopyFields);
 }
 {code}
 This function tries to find for an input source field all its copyFields (All 
 its destinations which Solr need to move this field). 
 As you can probably note, the first part of the procedure is the procedure 
 most “expensive” step (takes O( n ) time while N is the size of the 
 dynamicCopyFields group).
 The next part is just a simple hash extraction, which takes O(1) time. 
 Our schema contains over then 500 copyFields but only 70 of then are 
 indexed fields. 
 We also have one dynamic field with  a wildcard ( * ), which catches the 
 rest of the document fields. 
 As you can conclude, we have more than 400 copyFields that are based on this 
 dynamicField but all, except one, are fixed (i.e. does not contain any 
 wildcard).
 From some reason, the copyFields registration procedure defines those 400 
 fields as DynamicCopyField  and then store them in the “dynamicCopyFields” 
 array, 
 This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
 justification: All of those 400 copyFields are not glob and therefore do not 
 need any complex pattern matching to the input field. They all can be store 
 at the fixedCopyFields.
 Only copyFields with asterisks need this special treatment and they are 
 (especially on our case) pretty rare.  
 Therefore, we created a patch which fix this problem by changing the 
 registerCopyField() procedure.
 Test which we conducted show that there is no change in the Indexing results. 
 Moreover, the fix still successfully passes the class unit tests (i.e. 
 IndexSchemaTest.java).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

2014-10-29 Thread Liram Vardi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liram Vardi updated SOLR-:
--
Description: 
Result:
After applying a fix for this issue, tests which we conducted show more than 40 
percent improvement on our insertion performance.

Explanation:

Using JVM profiler, we found a CPU bottleneck during Solr indexing process. 
This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
following method, getCopyFieldsList():

{code:title=getCopyFieldsList() |borderStyle=solid}
final ListCopyField result = new ArrayList();
for (DynamicCopy dynamicCopy : dynamicCopyFields) {
  if (dynamicCopy.matches(sourceField)) {
result.add(new CopyField(getField(sourceField), 
dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
  }
}
ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField);
if (null != fixedCopyFields) {
  result.addAll(fixedCopyFields);
}
{code}

This function tries to find for an input source field all its copyFields (All 
its destinations which Solr need to move this field). 
As you can probably note, the first part of the procedure is the procedure most 
“expensive” step (takes O( n ) time while N is the size of the 
dynamicCopyFields group).
The next part is just a simple hash extraction, which takes O(1) time. 

Our schema contains over then 500 copyFields but only 70 of then are indexed 
fields. 
We also have one dynamic field with  a wildcard ( * ), which catches the rest 
of the document fields. 
As you can conclude, we have more than 400 copyFields that are based on this 
dynamicField but all, except one, are fixed (i.e. does not contain any 
wildcard).

From some reason, the copyFields registration procedure defines those 400 
fields as DynamicCopyField  and then store them in the “dynamicCopyFields” 
array, 
This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
justification: All of those 400 copyFields are not glob and therefore do not 
need any complex pattern matching to the input field. They all can be store at 
the fixedCopyFields.
Only copyFields with asterisks need this special treatment and they are 
(especially on our case) pretty rare.  

Therefore, we created a patch which fix this problem by changing the 
registerCopyField() procedure.
Test which we conducted show that there is no change in the Indexing results. 
Moreover, the fix still successfully passes the class unit tests (i.e. 
IndexSchemaTest.java).

   

  was:
Result:
After applying a fix for this issue, tests which we conducted show more than 40 
percent improvement on our insertion performance.

Explanation:

Using JVM profiler, we found a CPU bottleneck during Solr indexing process. 
This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
following method, getCopyFieldsList():

{code:title=getCopyFieldsList() |borderStyle=solid}
final ListCopyField result = new ArrayList();
for (DynamicCopy dynamicCopy : dynamicCopyFields) {
  if (dynamicCopy.matches(sourceField)) {
result.add(new CopyField(getField(sourceField), 
dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
  }
}
ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField);
if (null != fixedCopyFields) {
  result.addAll(fixedCopyFields);
}
{code}

This function tries to find for an input source field all its copyFields (All 
its destinations which Solr need to move this field). 
As you can probably note, the first part of the procedure is the procedure most 
“expensive” step (takes O(n) time while N is the size of the 
dynamicCopyFields group).
The next part is just a simple hash extraction, which takes O(1) time. 

Our schema contains over then 500 copyFields but only 70 of then are indexed 
fields. 
We also have one dynamic field with  a wildcard (*), which catches the rest 
of the document fields. 
As you can conclude, we have more than 400 copyFields that are based on this 
dynamicField but all, except one, are fixed (i.e. does not contain any 
wildcard).

From some reason, the copyFields registration procedure defines those 400 
fields as DynamicCopyField  and then store them in the “dynamicCopyFields” 
array, 
This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
justification: All of those 400 copyFields are not glob and therefore do not 
need any complex pattern matching to the input field. They all can be store at 
the fixedCopyFields.
Only copyFields with asterisks need this special treatment and they are 
(especially on our case) pretty rare.  

Therefore, we created a patch which fix this problem by changing the 
registerCopyField() procedure.
Test which we conducted show that there is no change in the Indexing results. 
Moreover, the fix still successfully passes the class unit tests (i.e. 
IndexSchemaTest.java).

   


 Dynamic copy 

[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

2014-10-29 Thread Liram Vardi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liram Vardi updated SOLR-:
--
Attachment: SOLR-.patch

 Dynamic copy fields are considering all dynamic fields, causing a significant 
 performance impact on indexing documents
 --

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis, update
 Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
 specific CopyFields for dynamic fields, but without wildcards (the fields are 
 dynamic, the copy directive is not)
Reporter: Liram Vardi
 Attachments: SOLR-.patch


 Result:
 After applying a fix for this issue, tests which we conducted show more than 
 40 percent improvement on our insertion performance.
 Explanation:
 Using JVM profiler, we found a CPU bottleneck during Solr indexing process. 
 This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
 following method, getCopyFieldsList():
 {code:title=getCopyFieldsList() |borderStyle=solid}
 final ListCopyField result = new ArrayList();
 for (DynamicCopy dynamicCopy : dynamicCopyFields) {
   if (dynamicCopy.matches(sourceField)) {
 result.add(new CopyField(getField(sourceField), 
 dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
   }
 }
 ListCopyField fixedCopyFields = copyFieldsMap.get(sourceField);
 if (null != fixedCopyFields) {
   result.addAll(fixedCopyFields);
 }
 {code}
 This function tries to find for an input source field all its copyFields (All 
 its destinations which Solr need to move this field). 
 As you can probably note, the first part of the procedure is the procedure 
 most “expensive” step (takes O( n ) time while N is the size of the 
 dynamicCopyFields group).
 The next part is just a simple hash extraction, which takes O(1) time. 
 Our schema contains over then 500 copyFields but only 70 of then are 
 indexed fields. 
 We also have one dynamic field with  a wildcard ( * ), which catches the 
 rest of the document fields. 
 As you can conclude, we have more than 400 copyFields that are based on this 
 dynamicField but all, except one, are fixed (i.e. does not contain any 
 wildcard).
 From some reason, the copyFields registration procedure defines those 400 
 fields as DynamicCopyField  and then store them in the “dynamicCopyFields” 
 array, 
 This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
 justification: All of those 400 copyFields are not glob and therefore do not 
 need any complex pattern matching to the input field. They all can be store 
 at the fixedCopyFields.
 Only copyFields with asterisks need this special treatment and they are 
 (especially on our case) pretty rare.  
 Therefore, we created a patch which fix this problem by changing the 
 registerCopyField() procedure.
 Test which we conducted show that there is no change in the Indexing results. 
 Moreover, the fix still successfully passes the class unit tests (i.e. 
 IndexSchemaTest.java).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org