Re: Beam IO Connector

Bruno Volpato via dev Sat, 12 Aug 2023 09:18:05 -0700

Hi Jeremy,

Apparently you are trying to use Beam's DirectRunner
<https://beam.apache.org/documentation/runners/direct/>, which is mostly
focused on small pipelines / testing purposes.
Even if it runs in the JVM, there are protections in place to make sure
your pipeline will be able to be distributed correctly when choosing a
production-ready runner (e.g., Dataflow, Spark, Flink), from the link above:


- enforcing immutability of elements
- enforcing encodability of elements

There are ways to disable those checks (--enforceEncodability=false,
--enforceImmutability=false), but to make sure you take the best out of
Beam and can run the pipeline in one of the runners in the future, I
believe the best way would be to write to a file, and read it back in the
GUI application (for the sink part).

For the source part, you may want to use Create
<https://beam.apache.org/documentation/transforms/java/other/create/> to
create a PCollection with specific elements for the in-memory scenario.

If you are getting exceptions for supported scenarios that you've
mentioned, there are a few things -- for example, if you are using lambda,
sometimes Java will try to Serialize the entire instance that holds members
being used. Creating your own DoFn classes and passing the Serializables
that what you need to use may resolve.


Best,
Bruno




On Sat, Aug 12, 2023 at 11:34 AM Jeremy Bloom <jeremybl...@gmail.com> wrote:

> Hello-
> I am fairly new to Beam but have been working with Apache Spark for a
> number of years. The application I am developing uses a data pipeline to
> ingest JSON with a particular schema, uses it to prepare data for a service
> that I do not control (a mathematical optimization solver), runs the
> application and recovers its results, and then publishes the results in
> JSON (same schema).  Although I work in Java, colleagues of mine are
> implementing in Python. This is an open-source, non-commercial project.
>
> The application has three kinds of IO sources/sinks: file system files
> (using Windows now, but Unix in the future), URL, and in-memory (string,
> byte buffer, etc). The last is primarily used for debugging, displayed in a
> JTextArea.
>
> I have not found a Beam IO connector that handles all three data
> sources/sinks, particularly the in-memory sink. I have tried adapting
> FileIO and TextIO, however, I continually run up against objects that are
> not serializable, particularly Java OutputStream and its subclasses. I have
> looked at the code for FileIO and TextIO as well as several other custom IO
> implementations, but none of them addresses this particular bug.
>
> The CSVSink example in the FileIO Javadoc uses a PrintWriter, which is not
> serializable; when I tried the same thing, I got a not-serializable
> exception. How does this example actually avoid this error? In the code for
> TextIO.Sink, the PrintWriter field is marked transient, meaning that it is
> not serialized, but again, when I tried the same thing, I got an exception.
>
> Please explain, in particular, how to write a Sink that avoids the not
> serializable exception. In general, please explain how I can use a Beam IO
> connector for the three kinds of data sources/sinks I want to use (file
> system, url, and in-memory).
>
> After the frustrations I had with Spark, I have high hopes for Beam. This
> issue is a blocker for me.
>
> Thank you.
> Jeremy Bloom
>

Re: Beam IO Connector

Reply via email to