Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-10 Thread Reza Rokni
Hi,

I think doing something in Asia later in the year would be very cool, I
think given there has not yet been a lot of formal activity there, we
should give plenty of time to ensure good local sponsorship etc...

Cheers

Reza

On Fri, 8 Nov 2019 at 19:10, jincheng sun  wrote:

> +1 for extend the discussion to the user mailing list?
>
> Maximilian Michels  于2019年11月8日周五 下午6:32写道:
>
>> The dates sounds good to me. I agree that the bay area has an advantage
>> because of its large tech community. On the other hand, it is a question
>> of how we run the event. For Berlin we managed to get about 200
>> attendees to Berlin, but for the BeamSummit in Las Vegas with ApacheCon
>> the attendance was much lower.
>>
>> Should this also be discussed on the user mailing list?
>>
>> Cheers,
>> Max
>>
>> On 07.11.19 22:50, Alex Van Boxel wrote:
>> > For date wise, I'm wondering why we should switching the Europe and NA
>> > one, this would mean that the Berlin and the new EU summit would be
>> > almost 1.5 years apart.
>> >
>> >   _/
>> > _/ Alex Van Boxel
>> >
>> >
>> > On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay > > > wrote:
>> >
>> > I prefer bay are for NA summit. My reasoning is that there is a
>> > criticall mass of contributors and users in that location, probably
>> > more than alternative NA locations. I was not involved with planning
>> > recently and I do not know if there were people who could attend due
>> > to location previously. If that is the case, I agree with Elliotte
>> > on looking for other options.
>> >
>> > Related to dates: March (Asia) and mid-May (NA) dates are a bit
>> > close. Mid-June for NA might be better to spread events. Other
>> > pieces looks good.
>> >
>> > Ahmet
>> >
>> > On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
>> > mailto:elh...@ibiblio.org>> wrote:
>> >
>> > The U.S. sadly is not a reliable destination for international
>> > conferences these days. Almost every conference I go to, big and
>> > small, has at least one speaker, sometimes more, who can't get
>> into
>> > the country. Canada seems worth considering. Vancouver,
>> > Montreal, and
>> > Toronto are all convenient.
>> >
>> > On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas > > > wrote:
>> >  >
>> >  > Hi Beam Community!
>> >  >
>> >  > I'd like to kick off a thread to discuss potential dates and
>> > venues for the 2020 Beam Summits.
>> >  >
>> >  > I did some research on industry conferences happening in 2020
>> > and pre-selected a few ranges as follows:
>> >  >
>> >  > (2 days) NA between mid-May and mid-June
>> >  > (2 days) EU mid October
>> >  > (1 day) Asia Mini Summit:  March
>> >  >
>> >  > I'd like to hear your thoughts on these dates and get
>> > consensus on exact dates as the convo progresses.
>> >  >
>> >  > For locations these are the options I reviewed:
>> >  >
>> >  > NA: Austin Texas, Berkeley California, Mexico City.
>> >  > Europe: Warsaw, Barcelona, Paris
>> >  > Asia: Singapore
>> >  >
>> >  > Let the discussion begin!
>> >  > G (on behalf of the Beam Summit Steering Committee)
>> >  >
>> >  >
>> >  >
>> >
>> >
>> > --
>> > Elliotte Rusty Harold
>> > elh...@ibiblio.org 
>> >
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


Re: Questions about the current and future design of the job service message stream

2019-11-10 Thread Chad Dombrova
Hi,

> You can see that each JobMessagesResponse may contain a message *or* a
>> GetJobStateResponse.
>>
>> What’s the intention behind this design?
>>
> I believe this was because a user may want to listen to both job state and
> messages all in one stream.
>

Just to be crystal clear, what's the advantage of using a single stream
versus two?

> The reason this is important to me is I’d like to make a handful of
>> changes to GetMessageStream to make it more powerful:
>>
>>- propagate messages from user code (if they opt in to setting up
>>their logger appropriately). currently, AFAICT, the only message the
>>message stream delivers is a final error, if the job fails (other than
>>state changes). It was clearly the original intent of this endpoint to
>>carry other types of messages, and I'd like to bring that to fruition.
>>
>> Log messages is a lot of data, we do have users writing GBs/s when
> aggregated across all their machines in Google Cloud so not sure if this
> will scale without a lot of control on filtering. Users sometimes don't
> recognize how much they are logging and if you have a 1000 VMs each writing
> only a few lines at a time you can easily saturate this stream.
>

Yes, we're concerned about message volume as well.  The plan would be to
add filters, which could be propagated from the job server to the logger on
the runner and sdk (if they support it) to avoid over-saturating the
stream.  For example, the log-level right now is basically ERROR, so we'd
propagate that to the runner and it would only send error messages back to
the job server.  Thus, we should hopefully be able to roll out this feature
without much change to the end user.  They could then opt-in to higher
volume message levels, if desired.

Some possible filters could be:

   - job id (required)
   - log level (default=ERROR)
   - transform id(s) (optional. defaults to just runner messages)
   - a jsonpath  selector for
   filtering on message metadata?

I think a logging implementation would consist of 2 parts:  the logging
service (i.e. an implementation of GetMessageStream) and the logging
handler for emitting messages from the runner and optionally user
transforms.  Handlers would need to be implemented for each SDK (i.e.
language).

The default logging implementation would consist of the InMemoryJobService
on the servicer side, which would send the filter object to the handler.
 The handler would pre-filter messages and stream them back to the standard
job service, which would simply forward on everything it receives, as it
does now.

A StackDriver logging service would be a bit different.  Its logging
handler might send *everything* to StackDriver so that there's a complete
record that can be sifted through later.  Its servicer component would
interpret the filter object into a StackDriver filter string and create a
subscription with StackDriver.

In this way we could support both semi-persistent logging services with a
queryable history (like StackDriver) and completely transient message
streams like we have now.

>
>>- make it possible to back GetMessageStream with logging services
>>like StackDriver, CloudWatch, or Elasticsearch
>>
>> That is interesting, originally the message stream was designed around
> system messages from the runner and not specifically around users log
> messages due to volume concerns. All logging integration to my knowledge
> has been deferred to the client libraries for those specific services.
>

What we're after is a user experience akin to what the Dataflow UI
provides: view a pipeline, open the log console, and view recent messages
from the runner.  click on a transform to view messages emitted by that
transform.  We've found Flink's logging and log UI to be sorely lacking and
we like the idea of tackling this problem at the Beam level, especially
considering so much of what we want is already there in some form.

Another use case that I think would benefit from this is providing custom
progress messages to users who launch a batch job from a shell, since the
message stream is already emitted there.   Of course, you'd have to be
careful about message volume, but as I mentioned there would be 2 levels
where you'd need to opt in:

   - changing log level from its default (ERROR)
   - setting up transform-level logging


-chad


Re: Cython unit test suites running without Cythonized sources

2019-11-10 Thread Chad Dombrova
Hi all,


> The sdist step creates a package that should be installed into each
> tox environment. If the tox environment has cython when this apache
> beam package is installed, it should be used. Nose (or whatever)
> should then run the tests.
>
I spent some time this weekend trying to understand the Beam python build
process, and here’s an overview of what I’ve learned:

   - the :sdks:python:sdist gradle task creates the source tarball (no
   surprises there)
  - the protobuf stubs are generated during this process
   - the sdist is provided to tox, which installs it into the the
   virtualenv for that task
   - for *-cython tasks, tox installs the cython dep and, as Ahmet
   asserted, python setup.py nosetests performs the cythonization.
  - this cythonized build overrides the one installed by tox

Here’s what I learned about the current status of tests wrt cython:

   - cython tox tasks *are* using cython (good!)
   - non-cython tox tasks *are not* using cython (good!)
   - none of the GCP or integration tests are using cython (bad?)
  - This is because the build is only cythonized when python setup.py
  nosetests is used in conjunction with tox (tox installs cython, python
  setup.py nosetests compiles it).
  - GCP tests don't install cython.  ITs don't use tox.

To confirm my understanding of this, I created a PR [1] to assert that a
cythonized or pure-python build is being used.  A cythonized build is
expected by default on linux systems unless a special flag is provided to
inform the test otherwise.  It appears as though the portable tests passed
(i.e. used cython), but I forgot to add the assertion for those; like the
other ITs they are not using cython.

*Questions:*

   - Is the lack of cython for ITs expected and/or desired?
   - Why aren't ITs using tox?  It's quite possible to pass arguments into
   tox to control it's behavior.  For example, it seems reasonable that
   run_integration_test.sh could be inside tox


*Next Steps:*There has been some movement in the python community to solve
problems around build dependencies [2] and toolchains [3].  I hope to have
a proposal for how to simplify this process soon.

[1] https://github.com/apache/beam/pull/10058
[2] https://www.python.org/dev/peps/pep-0517/
[3] https://www.python.org/dev/peps/pep-0518/

-chad