Currently, Python SDK doesn't have a transform for reading XML files. Probably your best bet will be to use Python SDK's file system [1] abstraction to read XML files from a custom ParDo. Also adding a reshuffle transform [2] following this will allow Dataflow to better rebalance steps that come after reading.
Thanks, Cham [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py [2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L614 On Fri, Nov 9, 2018 at 3:53 PM Sean Schwartz <[email protected]> wrote: > Hello, > > My company, SwiftIQ, uses google dataflow for our large scale data > processing pipeline. We currently are using java as our codebase. We are > looking at Python, but I'm having trouble trying to see if our dataflow can > be supported used Python. > > Our first step of the pipeline should be a I/O Read Transform of an XML > file. I see that this package exists in Java, however I'm not finding it as > a module in Python. > > Is there a Python module that does this? If not is there a way to write > our own custom Read Transform that reads a XML file into a PCollection? > > A quick response would be greatly appreciated. > > Thanks! > > Sean Schwartz > > -- > > Sean Schwartz > Data Engineer > Cell: 847.772.0240 >
