Currently, Python SDK doesn't have a transform for reading XML files.
Probably your best bet will be to use Python SDK's file system [1]
abstraction to read XML files from a custom ParDo. Also adding a reshuffle
transform [2] following this will allow Dataflow to better rebalance steps
that come after reading.

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L614


On Fri, Nov 9, 2018 at 3:53 PM Sean Schwartz <[email protected]> wrote:

> Hello,
>
> My company, SwiftIQ, uses google dataflow for our large scale data
> processing pipeline. We currently are using java as our codebase. We are
> looking at Python, but I'm having trouble trying to see if our dataflow can
> be supported used Python.
>
> Our first step of the pipeline should be a I/O Read Transform of an XML
> file. I see that this package exists in Java, however I'm not finding it as
> a module in Python.
>
> Is there a Python module that does this? If not is there a way to write
> our own custom Read Transform that reads a XML file into a PCollection?
>
> A quick response would be greatly appreciated.
>
> Thanks!
>
> Sean Schwartz
>
> --
>
> Sean Schwartz
> Data Engineer
> Cell: 847.772.0240
>

Reply via email to