Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Jacques Nadeau Sat, 06 Jul 2019 14:53:20 -0700

>
> What is the driving force for transport compression? Are you seeing that
>> as a major bottleneck in particular circumstances? (I'm not disagreeing,
>> just want to clearly define the particular problem you're worried about.)
>
>
> I've been working on a 20% project where we appear to be IO bound for
> transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
> seeing some of the same bottlenecks with the query engine they are is
> working on.  Trading off some CPU here would allow us to lower the overall
> latency in the system.
>


That's quite interesting. Can you share more about the use case. With the
exception of broadcast and round-robin type distribution patterns, we find
that there is typically more cycles focused on partitioning the sending
data such that IO bounding is less of a problem. In most of our operations,
almost all the largest workloads are done via partitioning thus it isn't
typically a problem. (We also have clients with 10gbps and 100gbps network
interconnects...) Are you partitioning the data pre-send?



> Random thought: what do you think of defining this at the transport level
>> rather than the record batch level? (e.g. in Arrow Flight). This is one way
>> to avoid extending the core record batch concept with something that isn't
>> related to processing (at least in your initial proposal)
>
>
> Per above, this seems like a reasonable approach to me if we want to hold
> off on buffer level compression.  Another use-case for buffer/record-batch
> level compression would be the Feather file format for only decompressing
> subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
> hold off adding compression to sparse batches until we have benchmarks
> showing the trade-off between channel level and buffer level compression.
>

I was proposing that type specific buffer encodings be done at the Flight
level, not message level encodings. Just want to make sure the formats
don't leak into the core spec until we're ready.

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to