lostluck commented on issue #30347:
URL: https://github.com/apache/beam/issues/30347#issuecomment-2248947759

   Thank you for the report!
   
   It shouldn't matter to prism that the file is in GCS.  I could see perhaps 
an issue when running docker workers, but prism usually runs Go SDK pipelines 
in loopback mode.
   
   Most likely the additional latency to GCS is causing splits to go haywire.
   
   We did have https://github.com/apache/beam/pull/29968 which reduced the 
split aggression, but that  made it into 2.54.0, so something else is at play 
(or the initial latency hit is very significant).
   
   
https://github.com/apache/beam/commits/release-2.54.0/sdks/go/pkg/beam/runners/prism/internal/stage.go
   
   
   This pipeline is roughly equivalent to the Wordcount example ( 
beam/sdks/go/examples/wordcount/wordcount.go) which does work.
   
   But the repro pipeline instead has an anonymous function as a DoFn. Those 
don't work on portable runners, and are not meaningfully supportable.  If the 
original tests were on the Go Direct Runner (which would be the case in 
2.54.0), then they'd have worked, but only because the Direct runner doesn't 
serialize anything. 
   
   I suspect that's the root cause of the issue in this case, and this is 
validated once I move the inlined function to a registered, named function call 
like the following snippet does:
   
   ```
   func updateLine(line string, emit func(string)) {
        output := fmt.Sprintf("line had length %d", len(line))
                fmt.Print(output + "\n")
                emit(output)
        }, all)
   }
   
   func init() {
     register.Function2x0(updateLine)
   }
   
   func main() {
      ...
      spans := beam.ParDo(s, updateLine, all)
      ...
   ```
   
   Unfortunately for us, it's not presently possible in Go to detect if a 
function is an anonymous function or not, outside of doing static analysis on 
the source code. Were I to redesign the SDK, I wouldn't permit simple functions 
like that as DoFn, as they cause more issues than they solve, or I'd provide a 
more closure/anonymous safe way of building pipelines to enable them, and avoid 
registrations entirely.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to