I've looked at arrow, and there's some trickiness. Beam has a record model and arrow works best with large batches of records. We could do per record encoding, but that might be inefficient in arrow.
On Thu, May 31, 2018, 5:50 PM Ismaël Mejía <[email protected]> wrote: > If I understand correctly Arrow allows a common multi language > in-memory data representation, so basically it is a columnar data > format that you can use to transfer data betweeen libraries in python > (pandas, numpy, etc), Java and other languages. This avoids the > round-trip to disk to do so. So we should maybe take a look to it > because it could be a pretty efficient way to transfer data in > multi-language pipelines (useful for portability). They even seem to > be working in a full platform based on it with streaming capabilities: > https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ > > There is also a serialized version of it called feather. I suppose > that an extension to support this format can make sense. > https://github.com/wesm/feather > > Maybe Holden can give some other ideas on possible valid uses on Beam > (or correct me if I say something incorrect) because this seems to be > important in the python on Spark world at this moment. > On Thu, May 31, 2018 at 2:01 AM Chamikara Jayalath <[email protected]> > wrote: > > > > > > > > On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik <[email protected]> wrote: > >> > >> For Python Parquet support, hopefully we can have cross language > pipelines solve this so we only need to implement it once. If it is really > popular, having it implemented more then once may be worthwhile. > > > > > > I'd say Parquet format is popular enough to warrant a Python > implementation :). Not sure if there are good candidate client libraries > for Python though. > > > >> > >> Would the point of Arrow be to treat it as an IO connector similar to > ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow > integration is)? > >> > >> Every C library adds some difficulty for users to test out their > pipelines locally unless the C library was cross compiled for several > distributions. Using C libraries increases the need for using a container > like Docker for execution. > > > > > > Usually we've preferred libraries that can be directly installed from > PyPI over libraries that have more complicated deployment models (native > compilation, Conda etc). This will make the connector easily available for > various runner/user deployments. > > > >> > >> > >> > >> On Wed, May 30, 2018 at 1:56 PM Austin Bennett < > [email protected]> wrote: > >>> > >>> I can see great use cases with s3/Parquet - so that's a great addition > (which JB is addressing, for Java)! > >>> > >>> It would be even more ideal for the use cases I find myself around for > there to be python parquet support, so for perhaps this next release: > >>> Would it make sense to be exploring: https://arrow.apache.org ? I'd > be happy to explore proper procedure for design/feature proposal and > documentation for Beam, how to scope and develop it. > >>> > >>> Also, from the little I've looked at actual implementation, it appears > that (py)arrow relies on underlying C binaries, which was listed as a > problem or at least a point against choice of package with the developing > python/kafka source. How big an issue is that -- what else should I be > considering? Guidance absolutely welcomed! >
