GitHub user dhalperi opened a pull request:
https://github.com/apache/incubator-beam/pull/353
Add BoundedReader APIs for expressing remaining and consumed parallelism
These are useful for dynamic work rebalancing and autoscaling.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhalperi/incubator-beam limited-parallelism
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-beam/pull/353.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #353
----
commit dfeecdbd6a751f0bac1f398dd1a86040a5c5166e
Author: Dan Halperin <[email protected]>
Date: 2016-05-04T00:53:48Z
BoundedReader: add getParallelism{Consumed,Remaining}
And implement it for common sources
commit 837a42dcce1d14a45ff7c8b7c3b4efdbbf98ef82
Author: Dan Halperin <[email protected]>
Date: 2016-05-15T20:54:57Z
OffsetBasedReader: test limited parallelism signals
commit 32894f8e4f7b68b940e71dcc52202bf2216914aa
Author: Dan Halperin <[email protected]>
Date: 2016-05-16T22:33:11Z
CompressedSource: add tests of parallelism and progress
*) empty file
*) non-empty compressed file
*) non-empty not-compressed file
commit ca29728dbff52a796279753c5b3efc3659f9ba06
Author: Dan Halperin <[email protected]>
Date: 2016-05-17T03:04:52Z
TextIO: implement and test parallelism
*) empty file
*) non-empty file
commit b866541f71df59f278e17b2895e7b412cbc7734f
Author: Dan Halperin <[email protected]>
Date: 2016-05-17T03:48:33Z
CountingSource: test limited parallelism
commit 1a4a0d99049871fe58f5194c59e0bf646894fae7
Author: Dan Halperin <[email protected]>
Date: 2016-05-17T05:33:07Z
AvroSource: rewrite to support remaining parallelism
*) Make the start of a block match Avro's definition: the first byte after
the previous sync marker.
This enables detecting the last block in the file.
*) This change enables us to unify currentOffset and currentBlockOffset, as
all records are emitted
at the start of the block that contains them.
*) Simplify block header reading to have fewer object allocations and
buffers using a direct
reader and a (allocated once only) CountingInputStream to measure the
size of that header.
*) Add tests for consumed and remaining parallelism
*) Let BlockBasedSource detect the end of the file in remaining parallelism.
*) Verify in more places that the correct number of bytes is read from
the input Avro file.
commit 4c775be82ccf68cdc221242e9cdfd4d0796a13e7
Author: Dan Halperin <[email protected]>
Date: 2016-05-17T08:47:54Z
CompressedSource: implement currentOffset based on bytes decompressed
This is not a very good offset because it is an upper bound, but it is
likely better than not reporting any progress at all.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---