GitHub user jkff opened a pull request:
https://github.com/apache/beam/pull/3957
Fixes TextIO and AvroIO tests of watchForNewFiles
* AvroIO: Need to specify a trigger to make sure that files are really
generated continuously and testing of watchForNewFiles is non-vacuous.
* TextIO: files were generated by manual code, and sometimes writing of a
file could race with TextIO reading it, and it might see the same file with two
different sizes, and count it as two different files (two Metadata objects for
the same filename with different sizes are not equal) and read the file twice.
It makes sense to address that separately: e.g. in the Watch transform
allow specifying a key extractor - but it's outside the scope of this PR and
tracked in https://issues.apache.org/jira/browse/BEAM-3030.
R: @reuvenlax
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkff/incubator-beam read-watch-test
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/3957.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3957
----
commit 59b450d82917707a0802c60cf910c998215cbca4
Author: Eugene Kirpichov <[email protected]>
Date: 2017-10-06T20:29:10Z
Fixes TextIO and AvroIO tests of watchForNewFiles
* AvroIO: Need to specify a trigger to make sure that files are really
generated continuously and testing of watchForNewFiles is non-vacuous.
* TextIO: files were generated by manual code,
and sometimes writing of a file could race with TextIO reading it, and
it might see the same file with two different sizes, and count it as two
different files (two Metadata objects for the same filename with
different sizes are not equal) and read the file twice.
It makes sense to address that separately: e.g. in the Watch transform
allow specifying a key extractor - but it's outside the scope of this
PR.
----
---