[ 
https://issues.apache.org/jira/browse/BEAM-13983?focusedWorklogId=759354&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759354
 ]

ASF GitHub Bot logged work on BEAM-13983:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Apr/22 16:08
            Start Date: 20/Apr/22 16:08
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r854308117


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib

Review Comment:
   I think should we catch the ImportError on this since it's an optional 
dependency. We should only fail if the user tries to load a model with 
`ModelFileType.JOBLIB` and joblib failed to import.



##########
sdks/python/setup.py:
##########
@@ -159,6 +159,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'freezegun>=0.3.12',
+    'joblib>=1.1.0',

Review Comment:
   > It is in required packages
   
   I think Andy is referring to this when he says `REQUIRED_PACKAGES`: 
https://github.com/apache/beam/blob/e4d2050ccbaafb90428ab6c0cc494039f6282dae/sdks/python/setup.py#L123-L152
   
   AFAICT joblib isn't there or anywhere else in setup.py. What do you mean by 
that?
   
   Regardless, I think it's appropriate to just add joblib in the test 
packages, since it's an optional dependency (most Beam users can get along 
without it, and even SklearnRunInference users can get along without it, unless 
they change the default `model_file_type` to joblib). That being said for an 
optional dependency, we may want to be more lenient (see my next comment)



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   It sounds like a single filepath is the appropriate interface for both 
sklearn and pytorch, why not be consistent?



##########
sdks/python/setup.py:
##########
@@ -169,6 +170,7 @@ def get_version():
     'pytest>=4.4.0,<5.0',
     'pytest-xdist>=1.29.0,<2',
     'pytest-timeout>=1.3.3,<2',
+    'scikit-learn>=0.24.2',

Review Comment:
   Is there a reason for the lower bound on the sklearn version? If this 
doesn't work with earlier versions we should make sure to communicate that 
somehow.



##########
sdks/python/apache_beam/ml/inference/sklearn_loader_test.py:
##########
@@ -151,6 +151,12 @@ def test_bad_file_raises(self):
             SklearnModelLoader(model_uri='/var/bad_file_name'))
         pipeline.run()
 
+  def test_bad_input_type_raises(self):
+    with tempfile.NamedTemporaryFile() as file:
+      with self.assertRaises(TypeError):

Review Comment:
   +1





Issue Time Tracking
-------------------

    Worklog Id:     (was: 759354)
    Time Spent: 3h 50m  (was: 3h 40m)

> Implement RunInference for Scikit-learn
> ---------------------------------------
>
>                 Key: BEAM-13983
>                 URL: https://issues.apache.org/jira/browse/BEAM-13983
>             Project: Beam
>          Issue Type: Sub-task
>          Components: sdk-py-core
>            Reporter: Andy Ye
>            Priority: P2
>              Labels: run-inference
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Implement RunInference for Scikit-learn as described in the design doc 
> [https://s.apache.org/inference-sklearn-pytorch]
> There will be a sklearn_impl.py file that contains SklearnModelLoader and 
> SlkearnInferenceRunner classes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to