On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik <[email protected]> wrote:

> For Python Parquet support, hopefully we can have cross language pipelines
> solve this so we only need to implement it once. If it is really popular,
> having it implemented more then once may be worthwhile.
>

I'd say Parquet format is popular enough to warrant a Python implementation
:). Not sure if there are good candidate client libraries for Python though.


> Would the point of Arrow be to treat it as an IO connector similar to
> ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow
> integration is)?
>
> Every C library adds some difficulty for users to test out their pipelines
> locally unless the C library was cross compiled for several distributions.
> Using C libraries increases the need for using a container like Docker for
> execution.
>

Usually we've preferred libraries that can be directly installed from PyPI
over libraries that have more complicated deployment models (native
compilation, Conda etc). This will make the connector easily available for
various runner/user deployments.


>
>
> On Wed, May 30, 2018 at 1:56 PM Austin Bennett <
> [email protected]> wrote:
>
>> I can see great use cases with s3/Parquet - so that's a great addition
>> (which JB is addressing, for Java)!
>>
>> It would be even more ideal for the use cases I find myself around for
>> there to be python parquet support, so for perhaps this next release:
>> Would it make sense to be exploring: https://arrow.apache.org ?  I'd be
>> happy to explore proper procedure for design/feature proposal and
>> documentation for Beam, how to scope and develop it.
>>
>> Also, from the little I've looked at actual implementation, it appears
>> that (py)arrow relies on underlying C binaries, which was listed as a
>> problem or at least a point against choice of package with the developing
>> python/kafka source.  How big an issue is that -- what else should I be
>> considering?  Guidance absolutely welcomed!
>>
>

Reply via email to