turb opened a new pull request, #38064: URL: https://github.com/apache/beam/pull/38064
`SnowflakeIO` has several steps: 1. Run a `COPY` that outputs partitioned gzipped CSV files in a directory 2. List the directory 3. Read the files 4. Parse CSV While 1. and 2. are done by one worker, 3. and 4. can be parallelized. It appears that Google Dataflow is able to do that (using work stealing?), but Apache Flink (with `--useDataStreamForBatch=true`) propagates the scale of 1. / 2. to 3. 4., leading to very long processing when it can be fully scalable. This change creates a `SnowflakeBoundedSource` instead of a simple `DoFn` to execute the `COPY` and then read the splits. When doing that, a bug appears: a race between the apparition of those splits and the reading. It is solved by a change in `LazyFlinkSourceSplitEnumerator` to make subtasks wait for the splits to be ready. I tested it still works on Google Dataflow. ------------------------ GitHub Actions Tests Status (on master branch) ------------------------------------------------------------------------------------------------ [](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
