If I understand correctly Arrow allows a common multi language in-memory data representation, so basically it is a columnar data format that you can use to transfer data betweeen libraries in python (pandas, numpy, etc), Java and other languages. This avoids the round-trip to disk to do so. So we should maybe take a look to it because it could be a pretty efficient way to transfer data in multi-language pipelines (useful for portability). They even seem to be working in a full platform based on it with streaming capabilities: https://blog.rstudio.com/2018/04/19/arrow-and-beyond/
There is also a serialized version of it called feather. I suppose that an extension to support this format can make sense. https://github.com/wesm/feather Maybe Holden can give some other ideas on possible valid uses on Beam (or correct me if I say something incorrect) because this seems to be important in the python on Spark world at this moment. On Thu, May 31, 2018 at 2:01 AM Chamikara Jayalath <[email protected]> wrote: > > > > On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik <[email protected]> wrote: >> >> For Python Parquet support, hopefully we can have cross language pipelines >> solve this so we only need to implement it once. If it is really popular, >> having it implemented more then once may be worthwhile. > > > I'd say Parquet format is popular enough to warrant a Python implementation > :). Not sure if there are good candidate client libraries for Python though. > >> >> Would the point of Arrow be to treat it as an IO connector similar to >> ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow >> integration is)? >> >> Every C library adds some difficulty for users to test out their pipelines >> locally unless the C library was cross compiled for several distributions. >> Using C libraries increases the need for using a container like Docker for >> execution. > > > Usually we've preferred libraries that can be directly installed from PyPI > over libraries that have more complicated deployment models (native > compilation, Conda etc). This will make the connector easily available for > various runner/user deployments. > >> >> >> >> On Wed, May 30, 2018 at 1:56 PM Austin Bennett <[email protected]> >> wrote: >>> >>> I can see great use cases with s3/Parquet - so that's a great addition >>> (which JB is addressing, for Java)! >>> >>> It would be even more ideal for the use cases I find myself around for >>> there to be python parquet support, so for perhaps this next release: >>> Would it make sense to be exploring: https://arrow.apache.org ? I'd be >>> happy to explore proper procedure for design/feature proposal and >>> documentation for Beam, how to scope and develop it. >>> >>> Also, from the little I've looked at actual implementation, it appears that >>> (py)arrow relies on underlying C binaries, which was listed as a problem or >>> at least a point against choice of package with the developing python/kafka >>> source. How big an issue is that -- what else should I be considering? >>> Guidance absolutely welcomed!
