Re: parquet/beam

Reuven Lax Thu, 31 May 2018 07:56:32 -0700

I've looked at arrow, and there's some trickiness. Beam has a record model
and arrow works best with large batches of records. We could do per record
encoding, but that might be inefficient in arrow.


On Thu, May 31, 2018, 5:50 PM Ismaël Mejía <[email protected]> wrote:

> If I understand correctly Arrow allows a common multi language
> in-memory data representation, so basically it is a columnar data
> format that you can use to transfer data betweeen libraries in python
> (pandas, numpy, etc), Java and other languages. This avoids the
> round-trip to disk to do so. So we should maybe take a look to it
> because it could be a pretty efficient way to transfer data in
> multi-language pipelines (useful for portability). They even seem to
> be working in a full platform based on it with streaming capabilities:
> https://blog.rstudio.com/2018/04/19/arrow-and-beyond/
>
> There is also a serialized version of it called feather. I suppose
> that an extension to support this format can make sense.
> https://github.com/wesm/feather
>
> Maybe Holden can give some other ideas on possible valid uses on Beam
> (or correct me if I say something incorrect) because this seems to be
> important in the python on Spark world at this moment.
> On Thu, May 31, 2018 at 2:01 AM Chamikara Jayalath <[email protected]>
> wrote:
> >
> >
> >
> > On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik <[email protected]> wrote:
> >>
> >> For Python Parquet support, hopefully we can have cross language
> pipelines solve this so we only need to implement it once. If it is really
> popular, having it implemented more then once may be worthwhile.
> >
> >
> > I'd say Parquet format is popular enough to warrant a Python
> implementation :). Not sure if there are good candidate client libraries
> for Python though.
> >
> >>
> >> Would the point of Arrow be to treat it as an IO connector similar to
> ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow
> integration is)?
> >>
> >> Every C library adds some difficulty for users to test out their
> pipelines locally unless the C library was cross compiled for several
> distributions. Using C libraries increases the need for using a container
> like Docker for execution.
> >
> >
> > Usually we've preferred libraries that can be directly installed from
> PyPI over libraries that have more complicated deployment models (native
> compilation, Conda etc). This will make the connector easily available for
> various runner/user deployments.
> >
> >>
> >>
> >>
> >> On Wed, May 30, 2018 at 1:56 PM Austin Bennett <
> [email protected]> wrote:
> >>>
> >>> I can see great use cases with s3/Parquet - so that's a great addition
> (which JB is addressing, for Java)!
> >>>
> >>> It would be even more ideal for the use cases I find myself around for
> there to be python parquet support, so for perhaps this next release:
> >>> Would it make sense to be exploring: https://arrow.apache.org ?  I'd
> be happy to explore proper procedure for design/feature proposal and
> documentation for Beam, how to scope and develop it.
> >>>
> >>> Also, from the little I've looked at actual implementation, it appears
> that (py)arrow relies on underlying C binaries, which was listed as a
> problem or at least a point against choice of package with the developing
> python/kafka source.  How big an issue is that -- what else should I be
> considering?  Guidance absolutely welcomed!
>

Re: parquet/beam

Reply via email to