[GitHub] [incubator-sdap-nexus] skorper commented on a diff in pull request #275: SDAP-487 - Changes to doms_data schema to improve result fetch speed

via GitHub Tue, 12 Sep 2023 16:49:57 -0700


skorper commented on code in PR #275:
URL: 
https://github.com/apache/incubator-sdap-nexus/pull/275#discussion_r1323762633



##########
analysis/webservice/algorithms/doms/ResultsStorage.py:
##########
@@ -361,7 +377,7 @@ def __retrieveStats(self, id):
             }
             return stats
 
-        raise Exception("Execution not found with id '%s'" % id)
+        raise NexusProcessingException(reason=f'Execution not found with id 
{str(execution_id)}', code=404)

Review Comment:
   Is this wording accurate? The execution may be found but the data may not be 
present



##########
tools/doms-data-tools/update_doms_data_pk.py:
##########
@@ -0,0 +1,299 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more

Review Comment:
   Can you add a docstring explaining at a high level what this should be used 
for? we can also mention it in the release notes. 



##########
analysis/webservice/algorithms/doms/ResultsStorage.py:
##########
@@ -297,18 +297,34 @@ def __retrieveData(self, id, trim_data=False, page_num=1, 
page_size=1000):
         return data
 
     def __enrichPrimaryDataWithMatches(self, id, dataMap, trim_data=False):
-        cql = "SELECT * FROM doms_data where execution_id = %s and is_primary 
= false"
-        rows = self._session.execute(cql, (id,))
-
-        for row in rows:
-            entry = self.__rowToDataEntry(row, trim_data=trim_data)
-            if row.primary_value_id in dataMap:
-                if not "matches" in dataMap[row.primary_value_id]:
-                    dataMap[row.primary_value_id]["matches"] = []
-                dataMap[row.primary_value_id]["matches"].append(entry)
+        cql = f"SELECT * FROM doms_data where execution_id = {str(id)} and 
is_primary = false and primary_value_id = ?"
+        statement = self._session.prepare(cql)
+
+        primary_ids = list(dataMap.keys())
+
+        logger.info(f'Getting secondary data for {len(primary_ids)} primaries 
of {str(id)}')
+
+        for (success, rows) in execute_concurrent_with_args(
+            self._session, statement, [(i,) for i in primary_ids], 
concurrency=50, results_generator=True
+        ):
+            for row in rows:
+                entry = self.__rowToDataEntry(row, trim_data=trim_data)
+                if row.primary_value_id in dataMap:
+                    if not "matches" in dataMap[row.primary_value_id]:
+                        dataMap[row.primary_value_id]["matches"] = []
+                    dataMap[row.primary_value_id]["matches"].append(entry)
+
+        # rows = self._session.execute(cql, (id,))

Review Comment:
   Can you remove this commented code? There is one other place in this PR that 
also contains commented code



##########
analysis/webservice/algorithms/doms/ResultsStorage.py:
##########
@@ -192,7 +192,9 @@ def __insertResults(self, execution_id, results):
         inserts = []
 
         for result in results:
-            inserts.extend(self.__prepare_result(execution_id, None, result, 
insertStatement))
+            # 'PRIMARY' arg since primary values cannot have primary_value_id 
be null anymore
+            # Secondary matches are prepped recursively from this call
+            inserts.extend(self.__prepare_result(execution_id, 'PRIMARY', 
result, insertStatement))

Review Comment:
   Can we change `'PRIMARY'` to None and check against that instead?



##########
analysis/webservice/algorithms/doms/ResultsStorage.py:
##########
@@ -297,18 +297,34 @@ def __retrieveData(self, id, trim_data=False, page_num=1, 
page_size=1000):
         return data
 
     def __enrichPrimaryDataWithMatches(self, id, dataMap, trim_data=False):
-        cql = "SELECT * FROM doms_data where execution_id = %s and is_primary 
= false"
-        rows = self._session.execute(cql, (id,))
-
-        for row in rows:
-            entry = self.__rowToDataEntry(row, trim_data=trim_data)
-            if row.primary_value_id in dataMap:
-                if not "matches" in dataMap[row.primary_value_id]:
-                    dataMap[row.primary_value_id]["matches"] = []
-                dataMap[row.primary_value_id]["matches"].append(entry)
+        cql = f"SELECT * FROM doms_data where execution_id = {str(id)} and 
is_primary = false and primary_value_id = ?"
+        statement = self._session.prepare(cql)
+
+        primary_ids = list(dataMap.keys())
+
+        logger.info(f'Getting secondary data for {len(primary_ids)} primaries 
of {str(id)}')
+
+        for (success, rows) in execute_concurrent_with_args(
+            self._session, statement, [(i,) for i in primary_ids], 
concurrency=50, results_generator=True
+        ):
+            for row in rows:
+                entry = self.__rowToDataEntry(row, trim_data=trim_data)
+                if row.primary_value_id in dataMap:
+                    if not "matches" in dataMap[row.primary_value_id]:
+                        dataMap[row.primary_value_id]["matches"] = []
+                    dataMap[row.primary_value_id]["matches"].append(entry)
+
+        # rows = self._session.execute(cql, (id,))
+        #
+        # for row in rows:
+        #     entry = self.__rowToDataEntry(row, trim_data=trim_data)
+        #     if row.primary_value_id in dataMap:
+        #         if not "matches" in dataMap[row.primary_value_id]:
+        #             dataMap[row.primary_value_id]["matches"] = []
+        #         dataMap[row.primary_value_id]["matches"].append(entry)
 
     def __retrievePrimaryData(self, id, trim_data=False, page_num=2, 
page_size=10):
-        cql = "SELECT * FROM doms_data where execution_id = %s and is_primary 
= true limit %s"
+        cql = "SELECT * FROM doms_data_temp where execution_id = %s and 
is_primary = true limit %s"

Review Comment:
   ```suggestion
           cql = "SELECT * FROM doms_data where execution_id = %s and 
is_primary = true limit %s"
   ```
   
   I'm assuming you forgot to change this over



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-sdap-nexus] skorper commented on a diff in pull request #275: SDAP-487 - Changes to doms_data schema to improve result fetch speed

Reply via email to