Re: parquet/beam

Ismaël Mejía Thu, 31 May 2018 07:51:03 -0700

If I understand correctly Arrow allows a common multi language
in-memory data representation, so basically it is a columnar data
format that you can use to transfer data betweeen libraries in python
(pandas, numpy, etc), Java and other languages. This avoids the
round-trip to disk to do so. So we should maybe take a look to it
because it could be a pretty efficient way to transfer data in
multi-language pipelines (useful for portability). They even seem to
be working in a full platform based on it with streaming capabilities:
https://blog.rstudio.com/2018/04/19/arrow-and-beyond/


There is also a serialized version of it called feather. I suppose
that an extension to support this format can make sense.
https://github.com/wesm/feather

Maybe Holden can give some other ideas on possible valid uses on Beam
(or correct me if I say something incorrect) because this seems to be
important in the python on Spark world at this moment.
On Thu, May 31, 2018 at 2:01 AM Chamikara Jayalath <[email protected]> wrote:
>
>
>
> On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik <[email protected]> wrote:
>>
>> For Python Parquet support, hopefully we can have cross language pipelines 
>> solve this so we only need to implement it once. If it is really popular, 
>> having it implemented more then once may be worthwhile.
>
>
> I'd say Parquet format is popular enough to warrant a Python implementation 
> :). Not sure if there are good candidate client libraries for Python though.
>
>>
>> Would the point of Arrow be to treat it as an IO connector similar to 
>> ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow 
>> integration is)?
>>
>> Every C library adds some difficulty for users to test out their pipelines 
>> locally unless the C library was cross compiled for several distributions. 
>> Using C libraries increases the need for using a container like Docker for 
>> execution.
>
>
> Usually we've preferred libraries that can be directly installed from PyPI 
> over libraries that have more complicated deployment models (native 
> compilation, Conda etc). This will make the connector easily available for 
> various runner/user deployments.
>
>>
>>
>>
>> On Wed, May 30, 2018 at 1:56 PM Austin Bennett <[email protected]> 
>> wrote:
>>>
>>> I can see great use cases with s3/Parquet - so that's a great addition 
>>> (which JB is addressing, for Java)!
>>>
>>> It would be even more ideal for the use cases I find myself around for 
>>> there to be python parquet support, so for perhaps this next release:
>>> Would it make sense to be exploring: https://arrow.apache.org ?  I'd be 
>>> happy to explore proper procedure for design/feature proposal and 
>>> documentation for Beam, how to scope and develop it.
>>>
>>> Also, from the little I've looked at actual implementation, it appears that 
>>> (py)arrow relies on underlying C binaries, which was listed as a problem or 
>>> at least a point against choice of package with the developing python/kafka 
>>> source.  How big an issue is that -- what else should I be considering?  
>>> Guidance absolutely welcomed!

Re: parquet/beam

Reply via email to