turb opened a new pull request, #38064:
URL: https://github.com/apache/beam/pull/38064

   `SnowflakeIO` has several steps:
   1. Run a `COPY` that outputs partitioned gzipped CSV files in a directory
   2. List the directory
   3. Read the files
   4. Parse CSV
   
   While 1. and 2. are done by one worker, 3. and 4. can be parallelized.
   
   It appears that Google Dataflow is able to do that (using work stealing?), 
but Apache Flink (with `--useDataStreamForBatch=true`) propagates the scale of 
1. / 2. to 3. 4., leading to very long processing when it can be fully scalable.
   
   This change creates a `SnowflakeBoundedSource` instead of a simple `DoFn` to 
execute the `COPY` and then read the splits.
   
   When doing that, a bug appears: a race between the apparition of those 
splits and the reading. It is solved by a change in 
`LazyFlinkSourceSplitEnumerator` to make subtasks wait for the splits to be 
ready.
   
   I tested it still works on Google Dataflow.
   
   ------------------------
   
   GitHub Actions Tests Status (on master branch)
   
------------------------------------------------------------------------------------------------
   [![Build python source distribution and 
wheels](https://github.com/apache/beam/actions/workflows/build_wheels.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python 
tests](https://github.com/apache/beam/actions/workflows/python_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java 
tests](https://github.com/apache/beam/actions/workflows/java_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go 
tests](https://github.com/apache/beam/actions/workflows/go_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more 
information about GitHub Actions CI or the [workflows 
README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) 
to see a list of phrases to trigger workflows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to