Re: parquet/beam

Kenneth Knowles Thu, 31 May 2018 09:16:22 -0700

For the latter, can we have the Fn API data plane transmit sub-bundle
groupings to benefit from the memory layout? On input the runner controls,
on output the SDK controls (spilling)? Just random thoughts.


Kenn

On Thu, May 31, 2018 at 8:21 AM Lukasz Cwik <[email protected]> wrote:

> Tyler and I had reached out to Arrow folks[1] asking about how could we
> support the KV<Key, Iterable<Values>> when the iterable of values is beyond
> memory size limits. There is an open JIRA about adding support for large
> byte[] and strings and list types in ARROW-750[2]. Robert had pointed out
> that we could do the same thing we are planning to do when using the Beam
> Fn Data API when handling really large values over the Beam Fn State API as
> described here[3].
>
> The other issue that hasn't yet been discussed is that Arrow materializes
> and stores the data on memory (or disk) while the Beam Fn Data API is more
> about "streaming" data between two actors. This allows us to process very
> large bundles and also allow for arbitrary blow up in output from a single
> element (a runner can effectively control how large a bundle is that is
> sent to an SDK harness but can't guarantee that the SDK will not take a
> single element and produce lots and lots of data from it).
>
> 1:
> https://lists.apache.org/thread.html/ce36c311e34af8bea230c89e7ada38923e6845d6bc875ccfbc003cfe@%3Cdev.arrow.apache.org%3E
> 2: https://issues.apache.org/jira/browse/ARROW-750
> 3:
> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.y6e78jyiwn50
>
> On Thu, May 31, 2018 at 7:56 AM Reuven Lax <[email protected]> wrote:
>
>> I've looked at arrow, and there's some trickiness. Beam has a record
>> model and arrow works best with large batches of records. We could do per
>> record encoding, but that might be inefficient in arrow.
>>
>> On Thu, May 31, 2018, 5:50 PM Ismaël Mejía <[email protected]> wrote:
>>
>>> If I understand correctly Arrow allows a common multi language
>>> in-memory data representation, so basically it is a columnar data
>>> format that you can use to transfer data betweeen libraries in python
>>> (pandas, numpy, etc), Java and other languages. This avoids the
>>> round-trip to disk to do so. So we should maybe take a look to it
>>> because it could be a pretty efficient way to transfer data in
>>> multi-language pipelines (useful for portability). They even seem to
>>> be working in a full platform based on it with streaming capabilities:
>>> https://blog.rstudio.com/2018/04/19/arrow-and-beyond/
>>>
>>> There is also a serialized version of it called feather. I suppose
>>> that an extension to support this format can make sense.
>>> https://github.com/wesm/feather
>>>
>>> Maybe Holden can give some other ideas on possible valid uses on Beam
>>> (or correct me if I say something incorrect) because this seems to be
>>> important in the python on Spark world at this moment.
>>> On Thu, May 31, 2018 at 2:01 AM Chamikara Jayalath <[email protected]>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik <[email protected]> wrote:
>>> >>
>>> >> For Python Parquet support, hopefully we can have cross language
>>> pipelines solve this so we only need to implement it once. If it is really
>>> popular, having it implemented more then once may be worthwhile.
>>> >
>>> >
>>> > I'd say Parquet format is popular enough to warrant a Python
>>> implementation :). Not sure if there are good candidate client libraries
>>> for Python though.
>>> >
>>> >>
>>> >> Would the point of Arrow be to treat it as an IO connector similar to
>>> ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow
>>> integration is)?
>>> >>
>>> >> Every C library adds some difficulty for users to test out their
>>> pipelines locally unless the C library was cross compiled for several
>>> distributions. Using C libraries increases the need for using a container
>>> like Docker for execution.
>>> >
>>> >
>>> > Usually we've preferred libraries that can be directly installed from
>>> PyPI over libraries that have more complicated deployment models (native
>>> compilation, Conda etc). This will make the connector easily available for
>>> various runner/user deployments.
>>> >
>>> >>
>>> >>
>>> >>
>>> >> On Wed, May 30, 2018 at 1:56 PM Austin Bennett <
>>> [email protected]> wrote:
>>> >>>
>>> >>> I can see great use cases with s3/Parquet - so that's a great
>>> addition (which JB is addressing, for Java)!
>>> >>>
>>> >>> It would be even more ideal for the use cases I find myself around
>>> for there to be python parquet support, so for perhaps this next release:
>>> >>> Would it make sense to be exploring: https://arrow.apache.org ?
>>> I'd be happy to explore proper procedure for design/feature proposal and
>>> documentation for Beam, how to scope and develop it.
>>> >>>
>>> >>> Also, from the little I've looked at actual implementation, it
>>> appears that (py)arrow relies on underlying C binaries, which was listed as
>>> a problem or at least a point against choice of package with the developing
>>> python/kafka source.  How big an issue is that -- what else should I be
>>> considering?  Guidance absolutely welcomed!
>>>
>>

Re: parquet/beam

Reply via email to