Modifying pip install behavior / custom pypi index

2020-09-09 Thread Chad Dombrova
Hi all,
We are running into problems trying to use our own pypi mirror with Beam.
For those who are not well versed in the esotera of python package
management, pip provides a few ways to specify urls for the pypi index
server:

   - command line
   [1]:
   via --index-url
   - environment variables
   [2]:
   via PIP_INDEX_URL. In Beam, we don’t have any way to influence the
   environment of the boot process that runs pip install.
   - pip.conf [3]:
   we could provide this as an artifact, but we don’t have any way of placing
   it in the correct location (e.g. /etc/pip.conf) on the instance that
   runs pip install.
   - requirements.txt files can specify certain pip install flags
   
[4],
   such as --index-url. As such, passing a requirements file via
   --requirements_file would theoretically work, but we also want to be
   able to provide dev packages as wheels via --extra_package, which would
   be installed independently from the requirements file and thus use the
   default pypi index. We may be able to upload our wheel as an artifact and
   refer to it using a local path in the requirements file, but this solution
   seems a bit brittle as the local artifacts path is different for each job.

Are there any known solutions to this problem? Here are some ideas:

   - add support for providing a pip.conf as a known artifact type (akin to
   --requirements_file).  this is by far the most powerful and
   straightforward solution, but do we have the stomach for yet another cli
   option?
   - add support for providing a destination path for artifacts, which
   would let us install it into /etc/pip.conf. I can see strong
   safety/security concerns around this.
   - provide a guarantee that the working directory for the boot process is
   inside the artifact directory: then we could refer to wheels inside our
   requirements file using relative paths.

We're happy to make a pull request to add support for this feature, but
it'd be great to have some input on the ideal solution before we begin.

thanks!
-chad

[1] https://pip.pypa.io/en/stable/reference/pip_install/#install-index-url
[2] https://pip.pypa.io/en/stable/user_guide/#environment-variables
[3] https://pip.pypa.io/en/stable/user_guide/#config-file
[4]
https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format

-chad


DRAFT - Beam board report September 2020

2020-09-09 Thread Kenneth Knowles
Hi all,

The Beam board report is technically due today. Sorry for the late notice.
I will leave it open for about 24 hours for people in every time zone to
hopefully have a chance to contribute.

Please add suggestions to the doc:

https://docs.google.com/document/d/1Z2pKu6NYdpYAka9IfTkEiAMOVubH1jjkkRQMaVJbs5Y/edit?usp=sharing

You can read past reports at
https://whimsy.apache.org/board/minutes/Beam.html to get a feel for it.
Here are some specific examples of things that are good to add:

 - highlights from CHANGES.md
 - interesting technical discussions that steer the project
 - major integrations with other projects
 - community events
 - major user facing addition/deprecation

It is OK to add very rough data and not be too careful with language. I
will edit the wording before I submit it. It helps me if you use
https://s.apache.org to create short links for things, as the board report
tool is a bit strict about formatting and does not like long links.

Kenn


[Proposal] Website Revamp project

2020-09-09 Thread Gris Cuevas
Hi Beam community, 

In a previous thread [1] I mentioned that I was going to work on product 
requirements document (PRD) for a project to address some of the requests and 
ideas collected by Aizhamal, Rose and David in previous efforts. 

The PRD is ready [2], and I'd like to get your feedback before moving forward 
into implementation. Please add you comments by Sunday Septermber 13, 2020. 

We're looking to start work on this project in around 2 weeks. 

Thank you! 
Gris

[1] 
https://lists.apache.org/thread.html/r1a4cea1e8b53bef73c49f75df13956024d8d78bc81b36e54ef71f8a9%40%3Cdev.beam.apache.org%3E

[2] https://s.apache.org/beam-site-revamp


Re: [PROPOSAL] Preparing for Beam 2.25.0 release

2020-09-09 Thread Rui Wang
Thanks Robin for working on this!


-Rui

On Wed, Sep 9, 2020 at 10:11 AM Robin Qiu  wrote:

> Hello everyone,
>
> The next Beam release (2.25.0) is scheduled to be cut on September 23
> according to the release calendar [1].
>
> I'd like to volunteer myself to handle this release. I plan on cutting the
> branch on that date and cherry-picking in release-blocking fixes
> afterwards. So unresolved release blocking JIRA issues should have
> their "Fix Version/s" marked as "2.25.0".
>
> Any comments or objections?
>
> Thanks,
> Robin Qiu
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>


[PROPOSAL] Preparing for Beam 2.25.0 release

2020-09-09 Thread Robin Qiu
Hello everyone,

The next Beam release (2.25.0) is scheduled to be cut on September 23
according to the release calendar [1].

I'd like to volunteer myself to handle this release. I plan on cutting the
branch on that date and cherry-picking in release-blocking fixes
afterwards. So unresolved release blocking JIRA issues should have
their "Fix Version/s" marked as "2.25.0".

Any comments or objections?

Thanks,
Robin Qiu

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com


Re: PubSub MultiReader

2020-09-09 Thread Iñigo San Jose
I made the modification to fuse topics and subscriptions, I now think it
may be cleaner. Attached a file with both versions.

Let me know what you think :D

Thanks

On Wed, Sep 9, 2020 at 5:35 PM Iñigo San Jose  wrote:

>
> Hi everyone!
>
> I have seen a very common use case in Beam which is pipelines that read
> from multiple PubSub topics or multiple subscriptions to end up flattening
> them. In general, this makes the pipeline harder to understand since not
> much context can be taken from it.
>
> I was thinking of adding a PTransform that reads from a list of
> Topics/Subscriptions with a simple Extend to the actual ReadFromPubSub
> Transform. This approach would help both developing and debugging the
> pipelines:
>
>- No time spent developing a multi reader
>- Easier organization of Topics/Subcriptions
>- Pipeline graph easier to the eye and less convoluted
>- Faster debugging
>- Avoid issues when pipelines are too wide and hide other parts of the
>pipeline
>
> As mentioned, I only have the PTransform based on an Extend, but in the
> future implementing this with Splittable DoFn would be the way to go. The
> PTransform takes 3 optional parameters:
>
>- topics: list of topics
>- subscriptions: list of subscriptions
>- with_context: boolean. If True it adds the topic/subscription name
>to the message, so it becomes a tuple of (topic/subs name, message). This
>could be helpful for future aggregations. The name may need to change.
>
> The parameters `topics` and `subscriptions` may be fused in a single
> parameter and use the path [1] to know if it's a topic or a subscription.
> But I consider it cleaner this way.
>
> Please find attached the current class I made as well as some screenshots
> of how the pipeline looks.
>
> Since I don't know much about SplittableDoFns yet, I was considering
> making a Pull Request for this PTransform and, on the meanwhile, work on a
> SplittableDoFn version.
>
> Thanks a lot for your time, let me know what you think
> Iñigo
>
> [1] projects//subscriptions/
> projects//topics/
>


-- 

*•** Iñigo San Jose | josein...@google.coom*
*•** Big Data *Technical Solutions Engineer
*• *Google Cloud Platform
• *Google, Dublin*
from apache_beam.io.gcp.pubsub import ReadFromPubSub
from apache_beam import Flatten, Map, PTransform


class PubSubMultipleReader(PTransform):
def __init__(self, topics=[], subscriptions=[], with_context=False):
self.topics = topics
self.subscriptions = subscriptions
self.with_context = with_context

def expand(self, pcol):
topics_subscriptions_pcol = []
for topic in self.topics:
topic_split = topic.split('/')
topic_project = topic_split[1]
topic_name = topic_split[-1]
current_topic = (
pcol | f'PubSub Topics/project:{topic_project}/Read {topic_name}' >> ReadFromPubSub(
topic=topic)
)
if self.with_context:
name = f'PubSub Topics/project:{topic_project}/Add Keys {topic_name}'
current_topic = current_topic | name >> Map(
lambda x: (topic, x))

topics_subscriptions_pcol.append(current_topic)

for subscription in self.subscriptions:
subscription_split = subscription.split('/')
subscription_project = subscription_split[1]
subscription_name = subscription_split[-1]
current_subscription = (
pcol | f'PubSub Subscriptions/project:{subscription_project}/Read {subscription_name}' >> ReadFromPubSub(
subscription=subscription)
)
if self.with_context:
name = f"PubSub Subscriptions/project:{subscription_project}/Add Keys {subscription_name}"
current_subscription = current_subscription | name >> Map(
lambda x: (subscription, x))

topics_subscriptions_pcol.append(current_subscription)

return tuple(topics_subscriptions_pcol) | Flatten()


class PubSubMultipleReaderV2(PTransform):
def __init__(self, source_list=[], with_context=False):
self.source_list = source_list
self.with_context = with_context

def expand(self, pcol):
sources_pcol = []
for source in self.source_list:
source_split = source.split('/')
source_project = source_split[1]
source_type = source_split[2]
source_name = source_split[-1]

step_name_base = f'PubSub {source_type}/project:{source_project}'

if source_type == 'topics':
current_source = (
pcol | f'{step_name_base}/Read {source_name}' >> ReadFromPubSub(
topic=source)
)
if source_type == 'subscriptions':
current_source = (
pcol | f'{step_name_base}/Read {source_name}' >> ReadFromPubSub(

PubSub MultiReader

2020-09-09 Thread Iñigo San Jose
Hi everyone!

I have seen a very common use case in Beam which is pipelines that read
from multiple PubSub topics or multiple subscriptions to end up flattening
them. In general, this makes the pipeline harder to understand since not
much context can be taken from it.

I was thinking of adding a PTransform that reads from a list of
Topics/Subscriptions with a simple Extend to the actual ReadFromPubSub
Transform. This approach would help both developing and debugging the
pipelines:

   - No time spent developing a multi reader
   - Easier organization of Topics/Subcriptions
   - Pipeline graph easier to the eye and less convoluted
   - Faster debugging
   - Avoid issues when pipelines are too wide and hide other parts of the
   pipeline

As mentioned, I only have the PTransform based on an Extend, but in the
future implementing this with Splittable DoFn would be the way to go. The
PTransform takes 3 optional parameters:

   - topics: list of topics
   - subscriptions: list of subscriptions
   - with_context: boolean. If True it adds the topic/subscription name to
   the message, so it becomes a tuple of (topic/subs name, message). This
   could be helpful for future aggregations. The name may need to change.

The parameters `topics` and `subscriptions` may be fused in a single
parameter and use the path [1] to know if it's a topic or a subscription.
But I consider it cleaner this way.

Please find attached the current class I made as well as some screenshots
of how the pipeline looks.

Since I don't know much about SplittableDoFns yet, I was considering making
a Pull Request for this PTransform and, on the meanwhile, work on a
SplittableDoFn version.

Thanks a lot for your time, let me know what you think
Iñigo

[1] projects//subscriptions/
projects//topics/
from apache_beam.io.gcp.pubsub import ReadFromPubSub
from apache_beam import Flatten, Map, PTransform

class PubSubMultipleReader(PTransform):
def __init__(self, topics=[], subscriptions=[], with_context=False):
self.topics = topics
self.subscriptions = subscriptions
self.with_context = with_context

def expand(self, pcol):
topics_subscriptions_pcol = []
for topic in self.topics:
topic_split = topic.split('/')
topic_project = topic_split[1]
topic_name = topic_split[-1]
current_topic = (
pcol | f'PubSub Topics/project:{topic_project}/Read {topic_name}' >> ReadFromPubSub(
topic=topic)
)
if self.with_context:
name = f'PubSub Topics/project:{topic_project}/Add Keys {topic_name}'
current_topic = current_topic | name >> Map(
lambda x: (topic, x))

topics_subscriptions_pcol.append(current_topic)

for subscription in self.subscriptions:
subscription_split = subscription.split('/')
subscription_project = subscription_split[1]
subscription_name = subscription_split[-1]
current_subscription = (
pcol | f'PubSub Subscriptions/project:{subscription_project}/Read {subscription_name}' >> ReadFromPubSub(
subscription=subscription)
)
if self.with_context:
name = f"PubSub Subscriptions/project:{subscription_project}/Add Keys {subscription_name}"
current_subscription = current_subscription | name >> Map(
lambda x: (subscription, x))

topics_subscriptions_pcol.append(current_subscription)

return tuple(topics_subscriptions_pcol) | Flatten()

Re: [ANNOUNCE] New committer: Heejong Lee

2020-09-09 Thread Alexey Romanenko
Congratulations! 

> On 9 Sep 2020, at 09:50, Jan Lukavský  wrote:
> 
> Congratulations!
> On 9/9/20 1:00 AM, Chamikara Jayalath wrote:
>> Congrats Heejong!
>> 
>> On Tue, Sep 8, 2020 at 1:55 PM Yichi Zhang > > wrote:
>> Congratulations Heejong!
>> 
>> On Tue, Sep 8, 2020 at 1:42 PM Ankur Goenka > > wrote:
>> Congratulations Heejong!
>> 
>> On Tue, Sep 8, 2020 at 1:40 PM Udi Meiri > > wrote:
>> Congrats Heejong!
>> 
>> On Tue, Sep 8, 2020 at 12:33 PM Tyson Hamilton > > wrote:
>> Congratulations!
>> 
>> On Tue, Sep 8, 2020, 12:10 PM Robert Bradshaw > > wrote:
>> Congratulations, Heejong!
>> 
>> On Tue, Sep 8, 2020 at 11:41 AM Rui Wang > > wrote:
>> Congrats, Heejong!
>> 
>> 
>> -Rui
>> 
>> On Tue, Sep 8, 2020 at 11:26 AM Robin Qiu > > wrote:
>> Congrats, Heejong!
>> 
>> On Tue, Sep 8, 2020 at 11:23 AM Valentyn Tymofieiev > > wrote:
>> Congratulations, Heejong!
>> 
>> On Tue, Sep 8, 2020 at 11:14 AM Ahmet Altay > > wrote:
>> Hi everyone,
>> 
>> Please join me and the rest of the Beam PMC in welcoming a new committer: 
>> Heejong Lee mailto:heej...@apache.org>>.
>> 
>> Heejong has been active in the community for more than 2 years, worked on 
>> various IOs (parquet, kafka, file, pubsub) and most recently worked on 
>> adding cross language transforms feature to Beam [1].
>> 
>> In consideration of his contributions, the Beam PMC trusts him with the 
>> responsibilities of a Beam committer [2].
>> 
>> Thank you for your contributions Heejong!
>> 
>> -Ahmet, on behalf of the Apache Beam PMC
>> 
>> [1] 
>> https://issues.apache.org/jira/browse/BEAM-10634?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(heejong)%20ORDER%20BY%20resolved%20DESC%2C%20affectedVersion%20ASC%2C%20priority%20DESC%2C%20updated%20DESC
>>  
>> 
>> [2] 
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>  
>> 



Re: Enabling checkpointing while running Flink Runner

2020-09-09 Thread Kyle Weaver
> But, from the configuration, there is no way to pass the checkpoint
interval.

Set the checkpointingInterval pipeline option.

https://beam.apache.org/documentation/runners/flink/

On Wed, Sep 9, 2020 at 4:44 AM Sruthi Sree Kumar 
wrote:

> Hi,
>
> How do we enable checkpointing for Flink Runner? To enable checkpointing,
> we set the checkpoint interval But, from the configuration, there is no way
> to pass the checkpoint interval.
>
> We are trying to enable checkpointing while running NEXMark queries as
> part of our experiments.
>
> --
> Regards,
>
> Sruthi
>
>


Enabling checkpointing while running Flink Runner

2020-09-09 Thread Sruthi Sree Kumar
Hi,

How do we enable checkpointing for Flink Runner? To enable checkpointing,
we set the checkpoint interval But, from the configuration, there is no way
to pass the checkpoint interval.

We are trying to enable checkpointing while running NEXMark queries as part
of our experiments.

-- 
Regards,

Sruthi


Re: [ANNOUNCE] New committer: Heejong Lee

2020-09-09 Thread Jan Lukavský

Congratulations!

On 9/9/20 1:00 AM, Chamikara Jayalath wrote:

Congrats Heejong!

On Tue, Sep 8, 2020 at 1:55 PM Yichi Zhang > wrote:


Congratulations Heejong!

On Tue, Sep 8, 2020 at 1:42 PM Ankur Goenka mailto:goe...@google.com>> wrote:

Congratulations Heejong!

On Tue, Sep 8, 2020 at 1:40 PM Udi Meiri mailto:eh...@google.com>> wrote:

Congrats Heejong!

On Tue, Sep 8, 2020 at 12:33 PM Tyson Hamilton
mailto:tyso...@google.com>> wrote:

Congratulations!

On Tue, Sep 8, 2020, 12:10 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:

Congratulations, Heejong!

On Tue, Sep 8, 2020 at 11:41 AM Rui Wang
mailto:ruw...@google.com>> wrote:

Congrats, Heejong!


-Rui

On Tue, Sep 8, 2020 at 11:26 AM Robin Qiu
mailto:robi...@google.com>> wrote:

Congrats, Heejong!

On Tue, Sep 8, 2020 at 11:23 AM Valentyn
Tymofieiev mailto:valen...@google.com>> wrote:

Congratulations, Heejong!

On Tue, Sep 8, 2020 at 11:14 AM Ahmet
Altay mailto:al...@google.com>> wrote:

Hi everyone,

Please join me and the rest of the
Beam PMC in welcoming
a new committer: Heejong Lee
mailto:heej...@apache.org>>.

Heejong has been active in the
community for more than 2 years,
worked on various IOs (parquet,
kafka, file, pubsub) and most
recently worked on adding cross
language transforms feature to
Beam [1].

In consideration of his
contributions, the Beam PMC trusts
him with the responsibilities of a
Beam committer [2].

Thank you for your contributions
Heejong!

-Ahmet, on behalf of the Apache
Beam PMC

[1]

https://issues.apache.org/jira/browse/BEAM-10634?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(heejong)%20ORDER%20BY%20resolved%20DESC%2C%20affectedVersion%20ASC%2C%20priority%20DESC%2C%20updated%20DESC
[2]

https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer