[ 
https://issues.apache.org/jira/browse/BEAM-7443?focusedWorklogId=251089&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-251089
 ]

ASF GitHub Bot logged work on BEAM-7443:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/May/19 16:35
            Start Date: 30/May/19 16:35
    Worklog Time Spent: 10m 
      Work Description: lukecwik commented on pull request #8641: [BEAM-7443] 
Create a BoundedSource -> SDF wrapper in Python SDK
URL: https://github.com/apache/beam/pull/8641#discussion_r289054823
 
 

 ##########
 File path: sdks/python/apache_beam/io/iobase.py
 ##########
 @@ -910,6 +913,67 @@ def from_runner_api_parameter(parameter, context):
     Read.from_runner_api_parameter)
 
 
+class SDFBoundedSourceWrapper(ptransform.PTransform):
+  """A ``PTransform`` that uses SDF to read from a ``BoundedSource``.
+  NOTE: This transform can only be used with beam_fn_api enabled.
+  """
+
+  def __init__(self, source):
+    if not isinstance(source, BoundedSource):
+      raise RuntimeError('SDFBoundedSourceWrapper can only wrap BoundedSource')
+    super(SDFBoundedSourceWrapper, self).__init__()
+    self.source = source
+
+  def _get_desired_chunk_size(self):
+    total_size = self.source.estimate_size()
+    if total_size:
+      # 1MB = 1 shard, 1GB = 32 shards, 1TB = 1000 shards, 1PB = 32k shards
+      chunk_size = max(1 << 20, 1000 * int(math.sqrt(total_size)))
+    else:
+      chunk_size = 64 << 20  # 64mb
+    return chunk_size
+
+  def _create_sdf_bounded_source_dofn(self):
+    from apache_beam.io.sdf_restriction_provider \
+      import SDFBoundedSourceRestrictionProvider
+    chunk_size = self._get_desired_chunk_size()
+    source = self.source
+
+    class SDFBoundedSourceDoFn(core.DoFn):
+      def __init__(self, read_source):
+        self.source = read_source
+
+      def process(
+          self,
+          element,
+          restriction_tracker=core.DoFn.RestrictionParam(
+              SDFBoundedSourceRestrictionProvider(source, chunk_size))):
+        start_pos, end_pos = restriction_tracker.current_restriction()
+        range_tracker = self.source.get_range_tracker(start_pos, end_pos)
+        return self.source.read(range_tracker)
+
+    return SDFBoundedSourceDoFn(self.source)
+
+  def expand(self, pbegin):
+    return (pbegin
+            | core.Impulse()
+            | core.ParDo(self._create_sdf_bounded_source_dofn()))
+
+  def get_windowing(self, unused_inputs):
+    return core.Windowing(window.GlobalWindows())
+
+  def _infer_output_coder(self, input_type=None, input_coder=None):
+    if isinstance(self.source, BoundedSource):
 
 Review comment:
   You check that source is a BoundedSource in the constructor which would mean 
that this is always true.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 251089)
    Time Spent: 1h 40m  (was: 1.5h)

>  BoundedSource->SDF needs a wrapper in Python SDK
> -------------------------------------------------
>
>                 Key: BEAM-7443
>                 URL: https://issues.apache.org/jira/browse/BEAM-7443
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-py-core
>            Reporter: Boyuan Zhang
>            Assignee: Boyuan Zhang
>            Priority: Major
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to