That's right, but is there a filesystem, that allows unbounded size of
files? If there will always be an upper size limit, does that mean that
you cannot use the order of elements in the file as is? You might need
to transfer the offset from one file to another (that's how Kafka does
it), but that implies that you don't use what natively gives you the
batch storage, but you store the offset yourself (as metadata).
Either way, maybe the discussion is not that important, because the
invariant requirement persists - there has to be a sequential observer
of the data, that creates sequence of updates in the order the data was
observed and persists this order. If you have two observers of data,
each storing his own (even unbounded in size) file, then (if partition
by key is not enforced) I'd say the ordering cannot be used.
This mechanism seems to me related to what limits parallelism in
streaming sources and why batch sources are generally better parallelised.
Jan
On 5/30/19 1:35 PM, Reuven Lax wrote:
Files can grow (depending on the filesystem), and tailing growing
files is a valid use case.
On Wed, May 29, 2019 at 3:23 PM Jan Lukavský <[email protected]
<mailto:[email protected]>> wrote:
> Offsets within a file, unordered between files seems exactly
analogous with offsets within a partition, unordered between
partitions,
right?
Not exactly. The key difference is in that partitions in streaming
stores are defined (on purpose, and with key impact on this
discussion)
as unbounded sequence of appends. Files, on the other hand are
always of
finite size. This difference makes the semantics of offsets in
partitioned stream useful, because the are guaranteed to only
increase.
On batch stores as files, these offsets would have to start from zero
after some (finite) time, which makes them useless for comparison.
On 5/29/19 2:44 PM, Robert Bradshaw wrote:
> On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <[email protected]
<mailto:[email protected]>> wrote:
>> As I understood it, Kenn was supporting the idea that sequence
metadata
>> is preferable over FIFO. I was trying to point out, that it
even should
>> provide the same functionally as FIFO, plus one important more -
>> reproducibility and ability to being persisted and reused the
same way
>> in batch and streaming.
>>
>> There is no doubt, that sequence metadata can be stored in every
>> storage. But, regarding some implicit ordering that sources
might have -
>> yes, of course, data written into HDFS or Cloud Storage has
ordering,
>> but only partial - inside some bulk (e.g. file) and the
ordering is not
>> defined correctly on boundaries of these bulks (between files).
That is
>> why I'd say, that ordering of sources is relevant only for
>> (partitioned!) streaming sources and generally always reduces to
>> sequence metadata (e.g. offsets).
> Offsets within a file, unordered between files seems exactly
analogous
> with offsets within a partition, unordered between partitions,
right?
>
>> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
>>> Huge +1 to all Kenn said.
>>>
>>> Jan, batch sources can have orderings too, just like Kafka. I
think
>>> it's reasonable (for both batch and streaming) that if a
source has an
>>> ordering that is an important part of the data, it should preserve
>>> this ordering into the data itself (e.g. as sequence numbers,
offsets,
>>> etc.)
>>>
>>> On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles
<[email protected] <mailto:[email protected]>> wrote:
>>>> I strongly prefer explicit sequence metadata over FIFO
requirements, because:
>>>>
>>>> - FIFO is complex to specify: for example Dataflow has
"per stage key-to-key" FIFO today, but it is not guaranteed to
remain so (plus "stage" is not a portable concept, nor even
guaranteed to remain a Dataflow concept)
>>>> - complex specifications are by definition poor usability
(if necessary, then it is what it is)
>>>> - overly restricts the runner, reduces parallelism, for
example any non-stateful ParDo has per-element parallelism, not
per "key"
>>>> - another perspective on that: FIFO makes everyone pay
rather than just the transform that requires exactly sequencing
>>>> - previous implementation details like reshuffles become
part of the model
>>>> - I'm not even convinced the use cases involved are
addressed by some careful FIFO restrictions; many sinks re-key and
they would all have to become aware of how keying of a sequence of
"stages" affects the end-to-end FIFO
>>>>
>>>> A noop becoming a non-noop is essentially the mathematical
definition of moving from higher-level to lower-level abstraction.
>>>>
>>>> So this strikes at the core question of what level of
abstraction Beam aims to represent. Lower-level means there are
fewer possible implementations and it is more tied to the
underlying architecture, and anything not near-exact match pays a
huge penalty. Higher-level means there are more implementations
possible with different tradeoffs, though they may all pay a minor
penalty.
>>>>
>>>> I could be convinced to change my mind, but it needs some
extensive design, examples, etc. I think it is probably about the
most consequential design decision in the whole Beam model, around
the same level as the decision to use ParDo and GBK as the
primitives IMO.
>>>>
>>>> Kenn