[jira] [Updated] (GEODE-6973) getExistingIdForType should not compare all entries in idToType region

xiaojian zhou (JIRA) Thu, 18 Jul 2019 15:31:39 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xiaojian zhou updated GEODE-6973:
---------------------------------
    Description: 
We found the PeerTypeRegistration's getExistingIdForType() will iterate through 
the idToType region's entries to find if the incoming newType is there. 

If idToType region contains 20K or 100K entries, this will impact the put 
throughput (customers did notice the performance downgrade when there're many 
pdxTypes). 

To make the things worse, the comparison is to compare the whole object, field 
to field. If the json object (which will be converted to pdxType) contains 30 
fields, the comparison will have to compare up to 30 fields. If the idToType 
region contains 20K entries, A new pdxType will do 20K  x 30 string comparisons 
before register it. 

We found each server maintained a typeToId map, this map is used to check if 
the pdxType exists. If exists, it will return the type id without check the 
IdToType region. The total number of pdxType did not impact the put performance 
if the pdxTypd exists. 

The typeToId map is maintained with a d-lock, each time we added a new pdxType, 
it will update into the map while still holding the d-lok. So we believe that 
the map should be the same as the region in content. If we cannot find the 
pdxType in the map, it should not be in the region. We can skip the iteration 
of region (which is the root cause of the performance issue). 

Another issue in current code is: when each time a new type come, it will 
recreate the map. This is unnecessary and contributes to the slowness too. 

We should only create the map during initialize(). 

Here are the tests we want to introduce:
1) a junit test to prove that reorder fields in a big JSON file will not cause 
significant hashcode conflicts (<1%)
2) a junit test to prove that add a index to a field in a big JSON file will 
hardly cause hashcode conflicts. 
This 2 tests are to prove that hashcode conflict is not the root cause of 
linear probing for PDXTypeId. 

3) a junit test to prove that for the cases that hashcode conflict caused by 
reordered fields, there will be no hashcode conflicts if using 
SORT_JSON_FIELD_NAMES_PROPERTY=true. 

4) a dunit test to prove that SORT_JSON_FIELD_NAMES_PROPERTY=true or false did 
not impact the performance to add a new pdxType. 

5) a dunit test to create a new pdxType from 2 peer server at the same time. 
The test is to prove that the d-lock take effect, one server create the 
pdxType, and another server should find the pdxType exists. 
Do this test both from server directly and from clients. 

6) Create 2 different objects which ends up with the same hashcode (we can get 
the 2 objects from test-1), try to put the 2 objects to create new pdxType. The 
2nd one should also create a new type. It should not be treated as "found an 
existing pdxType". 

  was:
We found the PeerTypeRegistration's getExistingIdForType() will iterate through 
the idToType region's entries to find if the incoming newType is there. 

If idToType region contains 20K or 100K entries, this will impact the put 
throughput (customers did notice the performance downgrade when there're many 
pdxTypes). 

To make the things worse, the comparison is to compare the whole object, field 
to field. If the json object (which will be converted to pdxType) contains 30 
fields, the comparison will have to compare up to 30 fields. If the idToType 
region contains 20K entries, A new pdxType will do 20K  x 30 string comparisons 
before register it. 

To speed up, we proposed to add an index hashmap, which used pdxType's hashcode 
as key, pdxTypeId's arrayList or hashset as value. 

When a new pdxType arrived, calculate its hashcode, then use the hashcode to 
find its pdxTypeIds (it could be a list if there're hash conflicts, but in most 
of case, it should be only one). Then use this small list to find the pdxTypes. 
Only need to do field-to-filed compare to this small list of pdxType. 

This algorithm will dramatically speed up.


> getExistingIdForType should not compare all entries in idToType region
> ----------------------------------------------------------------------
>
>                 Key: GEODE-6973
>                 URL: https://issues.apache.org/jira/browse/GEODE-6973
>             Project: Geode
>          Issue Type: Bug
>            Reporter: xiaojian zhou
>            Priority: Major
>
> We found the PeerTypeRegistration's getExistingIdForType() will iterate 
> through the idToType region's entries to find if the incoming newType is 
> there. 
> If idToType region contains 20K or 100K entries, this will impact the put 
> throughput (customers did notice the performance downgrade when there're many 
> pdxTypes). 
> To make the things worse, the comparison is to compare the whole object, 
> field to field. If the json object (which will be converted to pdxType) 
> contains 30 fields, the comparison will have to compare up to 30 fields. If 
> the idToType region contains 20K entries, A new pdxType will do 20K  x 30 
> string comparisons before register it. 
> We found each server maintained a typeToId map, this map is used to check if 
> the pdxType exists. If exists, it will return the type id without check the 
> IdToType region. The total number of pdxType did not impact the put 
> performance if the pdxTypd exists. 
> The typeToId map is maintained with a d-lock, each time we added a new 
> pdxType, it will update into the map while still holding the d-lok. So we 
> believe that the map should be the same as the region in content. If we 
> cannot find the pdxType in the map, it should not be in the region. We can 
> skip the iteration of region (which is the root cause of the performance 
> issue). 
> Another issue in current code is: when each time a new type come, it will 
> recreate the map. This is unnecessary and contributes to the slowness too. 
> We should only create the map during initialize(). 
> Here are the tests we want to introduce:
> 1) a junit test to prove that reorder fields in a big JSON file will not 
> cause significant hashcode conflicts (<1%)
> 2) a junit test to prove that add a index to a field in a big JSON file will 
> hardly cause hashcode conflicts. 
> This 2 tests are to prove that hashcode conflict is not the root cause of 
> linear probing for PDXTypeId. 
> 3) a junit test to prove that for the cases that hashcode conflict caused by 
> reordered fields, there will be no hashcode conflicts if using 
> SORT_JSON_FIELD_NAMES_PROPERTY=true. 
> 4) a dunit test to prove that SORT_JSON_FIELD_NAMES_PROPERTY=true or false 
> did not impact the performance to add a new pdxType. 
> 5) a dunit test to create a new pdxType from 2 peer server at the same time. 
> The test is to prove that the d-lock take effect, one server create the 
> pdxType, and another server should find the pdxType exists. 
> Do this test both from server directly and from clients. 
> 6) Create 2 different objects which ends up with the same hashcode (we can 
> get the 2 objects from test-1), try to put the 2 objects to create new 
> pdxType. The 2nd one should also create a new type. It should not be treated 
> as "found an existing pdxType". 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (GEODE-6973) getExistingIdForType should not compare all entries in idToType region

Reply via email to