Re: Root logger configuration

2019-12-13 Thread Ahmet Altay
On Fri, Dec 13, 2019 at 1:08 PM Robert Bradshaw  wrote:

> The default behavior of Python logging is
>
> (1) Logging handlers may get added (usually in main or very close to it).
> (2) Logging goes to those handlers, or a default handler if none were
> added.
>
> With this proposal, we would have
>
> (0) A default handler gets added.
> (1) Logging handlers may get added (usually in main or very close to
> it) in addition to the default.
> (2) Logging goes to these handlers and the default.
>
> On Fri, Dec 13, 2019 at 12:46 PM Pablo Estrada  wrote:
> >
> > I looked at the documentation for basicConfig:
> https://docs.python.org/3/library/logging.html#logging.basicConfig
> >
> > Specifically, the following line:
> >
> > > This function does nothing if the root logger already has handlers
> configured, unless the keyword argument force is set to True.
> >
> > That would mean that anyone can override the handling later on - which
> the workers do?
> > Best
> > -P.
> >
> > On Fri, Dec 13, 2019 at 10:55 AM Robert Bradshaw 
> wrote:
> >>
> >> Thanks for looking into this.
> >>
> >> I'm not sure unconditionally calling logging.basicConfig() on module
> >> import is the correct solution--this prevents modules that wish to set
> >> up handlers in place of the default handler from being able to do so.
> >> (This is why logging.basicConfig is lazily done at the first logging
> >> statement for the root logger, rather than earlier.)
> >>
> >> On Thu, Dec 12, 2019 at 4:34 PM Pablo Estrada 
> wrote:
> >> >
> >> > Hello all,
> >> > It has been pointed out to me by Chad, and also by others, that my
> logging changes have caused logs to start getting lost.
>

Does this include losing user logs? Is there a JIRA, and should we fix it
in Beam 2.18 branch?


> >> >
> >> > It seems that by never logging on the root logger, initialization for
> a root handler is skipped; and that's what causes the failures.
> >> >
> >> > I will work on a fix for this. I am thinking of providing a very
> simple apache_beam.utils.get_logger function that does something like this:
> >> >
> >> > def get_logger(name):
> >> >   logging.basicConfig()
> >> >   return logging.getLogger(name)
> >> >
> >> > And specific paths that need special handling of the logs should
> override this config by adding their own handlers (e.g. sdk_worker,
> fn_api_runner, etc).
> >> >
> >> > I hope I can have a fix for this by tomorrow.
> >> > Best
> >> > -P.
>


Re: Beam's job crashes on cluster

2019-12-13 Thread Matthew K.
This is not that difficult to implement, but would be better to be done when you guys integrated native job submition for Python.

 

However, I need t fix this last issue, which is the crash. Any idea where I should look at?

 
 

Sent: Friday, December 13, 2019 at 2:52 PM
From: "Kyle Weaver" 
To: dev 
Subject: Re: Beam's job crashes on cluster



> I applied some modifications to the code to run Beam tasks on k8s cluster using spark-submit.

 

Interesting, how does that work?
 


On Fri, Dec 13, 2019 at 12:49 PM Matthew K.  wrote:




 


I'm not sure if that could be a problem. I'm *not* running snadalone Spark. I applied some modifications to the code to run Beam tasks on k8s cluster using spark-submit. Therefore, worker nodes are spawned when spark-submit is called and connect to the master, and are supposed to be destroyed when job is finished.

 

Therefore, the crash should have some other reason.

 

Sent: Friday, December 13, 2019 at 2:37 PM
From: "Kyle Weaver" 
To: dev 
Subject: Re: Beam's job crashes on cluster


Correction: should be formatted `spark://host:port`. Should follow the rules here: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
 


On Fri, Dec 13, 2019 at 12:36 PM Kyle Weaver  wrote:


You probably will want to add argument `-PsparkMasterUrl=localhost:8080` (or whatever host:port your Spark master is on) to the job-server:runShadow command.
 

Without specifying the master URL, the default is to start an embedded Spark master within the same JVM as the job server, rather than using your standalone master.

 


On Fri, Dec 13, 2019 at 12:15 PM Matthew K.  wrote:




Job server is running on master node by this:

 


./gradlew :runners:spark:job-server:runShadow --gradle-user-home `pwd`

 

Spark workers (executors) run on separate nodes, sharing /tmp (1GB size) in order to be able to access Beam job's MANIFEST. I'm running Python 2.7.

 

There is no other shared resources between them. A pure Spark job works fine on the cluster (as far as I tested a simple one). If I'm not wrong, beam job executes with no problem when all master and workers run on the same node (but separate containers).

 

Sent: Friday, December 13, 2019 at 1:49 PM
From: "Kyle Weaver" 
To: dev@beam.apache.org
Subject: Re: Beam's job crashes on cluster




> Do workers need to talk to job server independent from spark executors?

 

No, they don't.

 

From the time stamps in your logs, it looks like the sigbus happened after the executor was lost.

 

Some additional info that might help us establish a chain of causation:

- the arguments you used to start the job server?

- the spark cluster deployment setup?

 

On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:









Actually the reason for that error is Job Server/JRE crashes at final stages and service becomes unavailable (note: job is on a very small dataset that is the absence of cluster, will be done in a couple of seconds):

 


19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB, free: 967.8 MB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
<-> 98% EXECUTING [2m 26s]
> IDLE
> IDLE
> IDLE
> :runners:spark:job-server:runShadow
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
#
# JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
#
# Core dump written. Default location: /opt/spark/beam/core or core.825
#
# An error report file with more information is saved as:
# /opt/spark/beam/hs_err_pid825.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

 

 

From /opt/spark/beam/hs_err_pid825.log:

 

Internal exceptions (10 events):  
Event: 0.664 Thread 0x7f5ad000a800 Exception  

Re: Root logger configuration

2019-12-13 Thread Robert Bradshaw
The default behavior of Python logging is

(1) Logging handlers may get added (usually in main or very close to it).
(2) Logging goes to those handlers, or a default handler if none were added.

With this proposal, we would have

(0) A default handler gets added.
(1) Logging handlers may get added (usually in main or very close to
it) in addition to the default.
(2) Logging goes to these handlers and the default.

On Fri, Dec 13, 2019 at 12:46 PM Pablo Estrada  wrote:
>
> I looked at the documentation for basicConfig: 
> https://docs.python.org/3/library/logging.html#logging.basicConfig
>
> Specifically, the following line:
>
> > This function does nothing if the root logger already has handlers 
> > configured, unless the keyword argument force is set to True.
>
> That would mean that anyone can override the handling later on - which the 
> workers do?
> Best
> -P.
>
> On Fri, Dec 13, 2019 at 10:55 AM Robert Bradshaw  wrote:
>>
>> Thanks for looking into this.
>>
>> I'm not sure unconditionally calling logging.basicConfig() on module
>> import is the correct solution--this prevents modules that wish to set
>> up handlers in place of the default handler from being able to do so.
>> (This is why logging.basicConfig is lazily done at the first logging
>> statement for the root logger, rather than earlier.)
>>
>> On Thu, Dec 12, 2019 at 4:34 PM Pablo Estrada  wrote:
>> >
>> > Hello all,
>> > It has been pointed out to me by Chad, and also by others, that my logging 
>> > changes have caused logs to start getting lost.
>> >
>> > It seems that by never logging on the root logger, initialization for a 
>> > root handler is skipped; and that's what causes the failures.
>> >
>> > I will work on a fix for this. I am thinking of providing a very simple 
>> > apache_beam.utils.get_logger function that does something like this:
>> >
>> > def get_logger(name):
>> >   logging.basicConfig()
>> >   return logging.getLogger(name)
>> >
>> > And specific paths that need special handling of the logs should override 
>> > this config by adding their own handlers (e.g. sdk_worker, fn_api_runner, 
>> > etc).
>> >
>> > I hope I can have a fix for this by tomorrow.
>> > Best
>> > -P.


Re: Beam's job crashes on cluster

2019-12-13 Thread Kyle Weaver
> I applied some modifications to the code to run Beam tasks on k8s cluster
using spark-submit.

Interesting, how does that work?

On Fri, Dec 13, 2019 at 12:49 PM Matthew K.  wrote:

>
> I'm not sure if that could be a problem. I'm *not* running snadalone
> Spark. I applied some modifications to the code to run Beam tasks on k8s
> cluster using spark-submit. Therefore, worker nodes are spawned when
> spark-submit is called and connect to the master, and are supposed to be
> destroyed when job is finished.
>
> Therefore, the crash should have some other reason.
>
> *Sent:* Friday, December 13, 2019 at 2:37 PM
> *From:* "Kyle Weaver" 
> *To:* dev 
> *Subject:* Re: Beam's job crashes on cluster
> Correction: should be formatted `spark://host:port`. Should follow the
> rules here:
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
>
> On Fri, Dec 13, 2019 at 12:36 PM Kyle Weaver  wrote:
>
>> You probably will want to add argument `-PsparkMasterUrl=localhost:8080`
>> (or whatever host:port your Spark master is on) to the job-server:runShadow
>> command.
>>
>> Without specifying the master URL, the default is to start an embedded
>> Spark master within the same JVM as the job server, rather than using your
>> standalone master.
>>
>> On Fri, Dec 13, 2019 at 12:15 PM Matthew K.  wrote:
>>
>>> Job server is running on master node by this:
>>>
>>> ./gradlew :runners:spark:job-server:runShadow --gradle-user-home `pwd`
>>>
>>> Spark workers (executors) run on separate nodes, sharing /tmp (1GB size)
>>> in order to be able to access Beam job's MANIFEST. I'm running Python 2.7.
>>>
>>> There is no other shared resources between them. A pure Spark job works
>>> fine on the cluster (as far as I tested a simple one). If I'm not wrong,
>>> beam job executes with no problem when all master and workers run on the
>>> same node (but separate containers).
>>>
>>> *Sent:* Friday, December 13, 2019 at 1:49 PM
>>> *From:* "Kyle Weaver" 
>>> *To:* dev@beam.apache.org
>>> *Subject:* Re: Beam's job crashes on cluster
>>> > Do workers need to talk to job server independent from spark executors?
>>>
>>> No, they don't.
>>>
>>> From the time stamps in your logs, it looks like the sigbus happened
>>> after the executor was lost.
>>>
>>> Some additional info that might help us establish a chain of causation:
>>> - the arguments you used to start the job server?
>>> - the spark cluster deployment setup?
>>>
>>> On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:
>>>
 Actually the reason for that error is Job Server/JRE crashes at final
 stages and service becomes unavailable (note: job is on a very small
 dataset that is the absence of cluster, will be done in a couple of
 seconds):

 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
 sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB,
 free: 967.8 MB)
 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
 <-> 98% EXECUTING [2m 26s]
 > IDLE
 > IDLE
 > IDLE
 > :runners:spark:job-server:runShadow
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825,
 tid=0x7f5abb886700
 #
 # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build
 1.8.0_232-b09)
 # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64
 compressed oops)
 # Problematic frame:
 # V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
 #
 # Core dump written. Default location: /opt/spark/beam/core or core.825
 #
 # An error report file with more information is saved as:
 # /opt/spark/beam/hs_err_pid825.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://bugreport.java.com/bugreport/crash.jsp
 #
 Aborted (core dumped)


 From /opt/spark/beam/hs_err_pid825.log:

 Internal exceptions (10
 events):

 Event: 0.664 Thread 0x7f5ad000a800 Exception >>> 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d72040) thrown at
 [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
 605]

 Event: 0.664 Thread 0x7f5ad000a800 Exception >>> 'java/lang/ArrayIndexOutOfBoundsException'> 

Re: Beam's job crashes on cluster

2019-12-13 Thread Matthew K.
 


I'm not sure if that could be a problem. I'm *not* running snadalone Spark. I applied some modifications to the code to run Beam tasks on k8s cluster using spark-submit. Therefore, worker nodes are spawned when spark-submit is called and connect to the master, and are supposed to be destroyed when job is finished.

 

Therefore, the crash should have some other reason.

 

Sent: Friday, December 13, 2019 at 2:37 PM
From: "Kyle Weaver" 
To: dev 
Subject: Re: Beam's job crashes on cluster


Correction: should be formatted `spark://host:port`. Should follow the rules here: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
 


On Fri, Dec 13, 2019 at 12:36 PM Kyle Weaver  wrote:


You probably will want to add argument `-PsparkMasterUrl=localhost:8080` (or whatever host:port your Spark master is on) to the job-server:runShadow command.
 

Without specifying the master URL, the default is to start an embedded Spark master within the same JVM as the job server, rather than using your standalone master.

 


On Fri, Dec 13, 2019 at 12:15 PM Matthew K.  wrote:




Job server is running on master node by this:

 


./gradlew :runners:spark:job-server:runShadow --gradle-user-home `pwd`

 

Spark workers (executors) run on separate nodes, sharing /tmp (1GB size) in order to be able to access Beam job's MANIFEST. I'm running Python 2.7.

 

There is no other shared resources between them. A pure Spark job works fine on the cluster (as far as I tested a simple one). If I'm not wrong, beam job executes with no problem when all master and workers run on the same node (but separate containers).

 

Sent: Friday, December 13, 2019 at 1:49 PM
From: "Kyle Weaver" 
To: dev@beam.apache.org
Subject: Re: Beam's job crashes on cluster




> Do workers need to talk to job server independent from spark executors?

 

No, they don't.

 

From the time stamps in your logs, it looks like the sigbus happened after the executor was lost.

 

Some additional info that might help us establish a chain of causation:

- the arguments you used to start the job server?

- the spark cluster deployment setup?

 

On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:









Actually the reason for that error is Job Server/JRE crashes at final stages and service becomes unavailable (note: job is on a very small dataset that is the absence of cluster, will be done in a couple of seconds):

 


19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB, free: 967.8 MB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
<-> 98% EXECUTING [2m 26s]
> IDLE
> IDLE
> IDLE
> :runners:spark:job-server:runShadow
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
#
# JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
#
# Core dump written. Default location: /opt/spark/beam/core or core.825
#
# An error report file with more information is saved as:
# /opt/spark/beam/hs_err_pid825.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

 

 

From /opt/spark/beam/hs_err_pid825.log:

 

Internal exceptions (10 events):  
Event: 0.664 Thread 0x7f5ad000a800 Exception  (0x000794d72040) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.664 Thread 0x7f5ad000a800 Exception  (0x000794d73e60) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.665 Thread 

Re: Root logger configuration

2019-12-13 Thread Pablo Estrada
I looked at the documentation for basicConfig:
https://docs.python.org/3/library/logging.html#logging.basicConfig

Specifically, the following line:

> This function does nothing if the root logger already has handlers
configured, unless the keyword argument force is set to True.

That would mean that anyone can override the handling later on - which the
workers do?
Best
-P.

On Fri, Dec 13, 2019 at 10:55 AM Robert Bradshaw 
wrote:

> Thanks for looking into this.
>
> I'm not sure unconditionally calling logging.basicConfig() on module
> import is the correct solution--this prevents modules that wish to set
> up handlers in place of the default handler from being able to do so.
> (This is why logging.basicConfig is lazily done at the first logging
> statement for the root logger, rather than earlier.)
>
> On Thu, Dec 12, 2019 at 4:34 PM Pablo Estrada  wrote:
> >
> > Hello all,
> > It has been pointed out to me by Chad, and also by others, that my
> logging changes have caused logs to start getting lost.
> >
> > It seems that by never logging on the root logger, initialization for a
> root handler is skipped; and that's what causes the failures.
> >
> > I will work on a fix for this. I am thinking of providing a very simple
> apache_beam.utils.get_logger function that does something like this:
> >
> > def get_logger(name):
> >   logging.basicConfig()
> >   return logging.getLogger(name)
> >
> > And specific paths that need special handling of the logs should
> override this config by adding their own handlers (e.g. sdk_worker,
> fn_api_runner, etc).
> >
> > I hope I can have a fix for this by tomorrow.
> > Best
> > -P.
>


Re: Beam's job crashes on cluster

2019-12-13 Thread Kyle Weaver
Correction: should be formatted `spark://host:port`. Should follow the
rules here:
https://spark.apache.org/docs/latest/submitting-applications.html#master-urls

On Fri, Dec 13, 2019 at 12:36 PM Kyle Weaver  wrote:

> You probably will want to add argument `-PsparkMasterUrl=localhost:8080`
> (or whatever host:port your Spark master is on) to the job-server:runShadow
> command.
>
> Without specifying the master URL, the default is to start an embedded
> Spark master within the same JVM as the job server, rather than using your
> standalone master.
>
> On Fri, Dec 13, 2019 at 12:15 PM Matthew K.  wrote:
>
>> Job server is running on master node by this:
>>
>> ./gradlew :runners:spark:job-server:runShadow --gradle-user-home `pwd`
>>
>> Spark workers (executors) run on separate nodes, sharing /tmp (1GB size)
>> in order to be able to access Beam job's MANIFEST. I'm running Python 2.7.
>>
>> There is no other shared resources between them. A pure Spark job works
>> fine on the cluster (as far as I tested a simple one). If I'm not wrong,
>> beam job executes with no problem when all master and workers run on the
>> same node (but separate containers).
>>
>> *Sent:* Friday, December 13, 2019 at 1:49 PM
>> *From:* "Kyle Weaver" 
>> *To:* dev@beam.apache.org
>> *Subject:* Re: Beam's job crashes on cluster
>> > Do workers need to talk to job server independent from spark executors?
>>
>> No, they don't.
>>
>> From the time stamps in your logs, it looks like the sigbus happened
>> after the executor was lost.
>>
>> Some additional info that might help us establish a chain of causation:
>> - the arguments you used to start the job server?
>> - the spark cluster deployment setup?
>>
>> On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:
>>
>>> Actually the reason for that error is Job Server/JRE crashes at final
>>> stages and service becomes unavailable (note: job is on a very small
>>> dataset that is the absence of cluster, will be done in a couple of
>>> seconds):
>>>
>>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
>>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
>>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
>>> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
>>> sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB,
>>> free: 967.8 MB)
>>> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
>>> 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
>>> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
>>> 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
>>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
>>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
>>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
>>> <-> 98% EXECUTING [2m 26s]
>>> > IDLE
>>> > IDLE
>>> > IDLE
>>> > :runners:spark:job-server:runShadow
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
>>> #
>>> # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build
>>> 1.8.0_232-b09)
>>> # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64
>>> compressed oops)
>>> # Problematic frame:
>>> # V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
>>> #
>>> # Core dump written. Default location: /opt/spark/beam/core or core.825
>>> #
>>> # An error report file with more information is saved as:
>>> # /opt/spark/beam/hs_err_pid825.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://bugreport.java.com/bugreport/crash.jsp
>>> #
>>> Aborted (core dumped)
>>>
>>>
>>> From /opt/spark/beam/hs_err_pid825.log:
>>>
>>> Internal exceptions (10
>>> events):
>>>
>>> Event: 0.664 Thread 0x7f5ad000a800 Exception >> 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d72040) thrown at
>>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>>> 605]
>>>
>>> Event: 0.664 Thread 0x7f5ad000a800 Exception >> 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d73e60) thrown at
>>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>>> 605]
>>>
>>> Event: 0.665 Thread 0x7f5ad000a800 Exception >> 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d885d0) thrown at
>>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>>> 605]
>>>
>>> Event: 0.665 Thread 0x7f5ad000a800 Exception >> 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d8c6d8) thrown at
>>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>>> 605]
>>>
>>> Event: 0.673 Thread 0x7f5ad000a800 Exception >> 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794df7b70) thrown at
>>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>>> 605]
>>>
>>> Event: 0.674 Thread 0x7f5ad000a800 

Re: Beam's job crashes on cluster

2019-12-13 Thread Kyle Weaver
You probably will want to add argument `-PsparkMasterUrl=localhost:8080`
(or whatever host:port your Spark master is on) to the job-server:runShadow
command.

Without specifying the master URL, the default is to start an embedded
Spark master within the same JVM as the job server, rather than using your
standalone master.

On Fri, Dec 13, 2019 at 12:15 PM Matthew K.  wrote:

> Job server is running on master node by this:
>
> ./gradlew :runners:spark:job-server:runShadow --gradle-user-home `pwd`
>
> Spark workers (executors) run on separate nodes, sharing /tmp (1GB size)
> in order to be able to access Beam job's MANIFEST. I'm running Python 2.7.
>
> There is no other shared resources between them. A pure Spark job works
> fine on the cluster (as far as I tested a simple one). If I'm not wrong,
> beam job executes with no problem when all master and workers run on the
> same node (but separate containers).
>
> *Sent:* Friday, December 13, 2019 at 1:49 PM
> *From:* "Kyle Weaver" 
> *To:* dev@beam.apache.org
> *Subject:* Re: Beam's job crashes on cluster
> > Do workers need to talk to job server independent from spark executors?
>
> No, they don't.
>
> From the time stamps in your logs, it looks like the sigbus happened after
> the executor was lost.
>
> Some additional info that might help us establish a chain of causation:
> - the arguments you used to start the job server?
> - the spark cluster deployment setup?
>
> On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:
>
>> Actually the reason for that error is Job Server/JRE crashes at final
>> stages and service becomes unavailable (note: job is on a very small
>> dataset that is the absence of cluster, will be done in a couple of
>> seconds):
>>
>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
>> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
>> sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB,
>> free: 967.8 MB)
>> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
>> 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
>> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
>> 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
>> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
>> <-> 98% EXECUTING [2m 26s]
>> > IDLE
>> > IDLE
>> > IDLE
>> > :runners:spark:job-server:runShadow
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
>> #
>> # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build
>> 1.8.0_232-b09)
>> # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64
>> compressed oops)
>> # Problematic frame:
>> # V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
>> #
>> # Core dump written. Default location: /opt/spark/beam/core or core.825
>> #
>> # An error report file with more information is saved as:
>> # /opt/spark/beam/hs_err_pid825.log
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.java.com/bugreport/crash.jsp
>> #
>> Aborted (core dumped)
>>
>>
>> From /opt/spark/beam/hs_err_pid825.log:
>>
>> Internal exceptions (10
>> events):
>>
>> Event: 0.664 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d72040) thrown at
>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>> 605]
>>
>> Event: 0.664 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d73e60) thrown at
>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>> 605]
>>
>> Event: 0.665 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d885d0) thrown at
>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>> 605]
>>
>> Event: 0.665 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d8c6d8) thrown at
>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>> 605]
>>
>> Event: 0.673 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794df7b70) thrown at
>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>> 605]
>>
>> Event: 0.674 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794df8f38) thrown at
>> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
>> 605]
>> Event: 0.674 Thread 0x7f5ad000a800 Exception > 'java/lang/ArrayIndexOutOfBoundsException'> (0x000794dfa5b8) thrown at
>> 

Re: Beam's job crashes on cluster

2019-12-13 Thread Matthew K.
Job server is running on master node by this:

 


./gradlew :runners:spark:job-server:runShadow --gradle-user-home `pwd`

 

Spark workers (executors) run on separate nodes, sharing /tmp (1GB size) in order to be able to access Beam job's MANIFEST. I'm running Python 2.7.

 

There is no other shared resources between them. A pure Spark job works fine on the cluster (as far as I tested a simple one). If I'm not wrong, beam job executes with no problem when all master and workers run on the same node (but separate containers).

 

Sent: Friday, December 13, 2019 at 1:49 PM
From: "Kyle Weaver" 
To: dev@beam.apache.org
Subject: Re: Beam's job crashes on cluster




> Do workers need to talk to job server independent from spark executors?

 

No, they don't.

 

From the time stamps in your logs, it looks like the sigbus happened after the executor was lost.

 

Some additional info that might help us establish a chain of causation:

- the arguments you used to start the job server?

- the spark cluster deployment setup?

 

On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:









Actually the reason for that error is Job Server/JRE crashes at final stages and service becomes unavailable (note: job is on a very small dataset that is the absence of cluster, will be done in a couple of seconds):

 


19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB, free: 967.8 MB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
<-> 98% EXECUTING [2m 26s]
> IDLE
> IDLE
> IDLE
> :runners:spark:job-server:runShadow
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
#
# JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
#
# Core dump written. Default location: /opt/spark/beam/core or core.825
#
# An error report file with more information is saved as:
# /opt/spark/beam/hs_err_pid825.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

 

 

From /opt/spark/beam/hs_err_pid825.log:

 

Internal exceptions (10 events):  
Event: 0.664 Thread 0x7f5ad000a800 Exception  (0x000794d72040) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.664 Thread 0x7f5ad000a800 Exception  (0x000794d73e60) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.665 Thread 0x7f5ad000a800 Exception  (0x000794d885d0) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.665 Thread 0x7f5ad000a800 Exception  (0x000794d8c6d8) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.673 Thread 0x7f5ad000a800 Exception  (0x000794df7b70) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.674 Thread 0x7f5ad000a800 Exception  (0x000794df8f38) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605] 

Re: Beam's job crashes on cluster

2019-12-13 Thread Kyle Weaver
> Do workers need to talk to job server independent from spark executors?

No, they don't.

>From the time stamps in your logs, it looks like the sigbus happened after
the executor was lost.

Some additional info that might help us establish a chain of causation:
- the arguments you used to start the job server?
- the spark cluster deployment setup?

On Fri, Dec 13, 2019 at 8:00 AM Matthew K.  wrote:

> Actually the reason for that error is Job Server/JRE crashes at final
> stages and service becomes unavailable (note: job is on a very small
> dataset that is the absence of cluster, will be done in a couple of
> seconds):
>
> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
> sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB,
> free: 967.8 MB)
> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
> 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
> 19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
> 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
> 19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
> <-> 98% EXECUTING [2m 26s]
> > IDLE
> > IDLE
> > IDLE
> > :runners:spark:job-server:runShadow
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build
> 1.8.0_232-b09)
> # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
> #
> # Core dump written. Default location: /opt/spark/beam/core or core.825
> #
> # An error report file with more information is saved as:
> # /opt/spark/beam/hs_err_pid825.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Aborted (core dumped)
>
>
> From /opt/spark/beam/hs_err_pid825.log:
>
> Internal exceptions (10
> events):
>
> Event: 0.664 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d72040) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.664 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d73e60) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.665 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d885d0) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.665 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794d8c6d8) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.673 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794df7b70) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.674 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794df8f38) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
> Event: 0.674 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794dfa5b8) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.674 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794dfb6f0) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.674 Thread 0x7f5ad000a800 Exception  'java/lang/ArrayIndexOutOfBoundsException'> (0x000794dfedf0) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line
> 605]
>
> Event: 0.695 Thread 0x7f5ad000a800 Exception  'java/lang/NoClassDefFoundError': org/slf4j/impl/StaticMarkerBinder>
> (0x000794f69e70) thrown at
> [/home/openjdk/jdk8u/hotspot/src/share/vm/classfile/systemDictionary.cpp,
> line 199]
>
>
> Looking at the logs when running the script, I can see exectors become
> lost, but not sure if that might be related to the crash of the job server:
>
> 19/12/13 15:07:29 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID
> 13, 192.168.118.75, executor 1, partition 0, PROCESS_LOCAL, 8055
> bytes)
> 19/12/13 15:07:29 INFO BlockManagerInfo: Added broadcast_10_piece0 in
> memory on 192.168.118.75:37327 (size: 47.3 KB, free: 3.3
> GB)
> 19/12/13 

Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Robert Bradshaw
+1 (binding)

On Fri, Dec 13, 2019 at 10:23 AM Pablo Estrada  wrote:
>
> +1 (binding)
>
> On Fri, Dec 13, 2019 at 8:47 AM Maximilian Michels  wrote:
>>
>> +1 (binding)
>>
>> On 13.12.19 17:10, Jeff Klukas wrote:
>> > +1 (non-binding)
>> >
>> > On Thu, Dec 12, 2019 at 11:58 PM Kenneth Knowles > > > wrote:
>> >
>> > Please vote on the proposal for Beam's mascot to be the Firefly.
>> > This encompasses the Lampyridae family of insects, without
>> > specifying a genus or species.
>> >
>> > [ ] +1, Approve Firefly being the mascot
>> > [ ] -1, Disapprove Firefly being the mascot
>> >
>> > The vote will be open for at least 72 hours excluding weekends. It
>> > is adopted by at least 3 PMC +1 approval votes, with no PMC -1
>> > disapproval votes*. Non-PMC votes are still encouraged.
>> >
>> > PMC voters, please help by indicating your vote as "(binding)"
>> >
>> > Kenn
>> >
>> > *I have chosen this format for this vote, even though Beam uses
>> > simple majority as a rule, because I want any PMC member to be able
>> > to veto based on concerns about overlap or trademark.
>> >


Re: Root logger configuration

2019-12-13 Thread Robert Bradshaw
Thanks for looking into this.

I'm not sure unconditionally calling logging.basicConfig() on module
import is the correct solution--this prevents modules that wish to set
up handlers in place of the default handler from being able to do so.
(This is why logging.basicConfig is lazily done at the first logging
statement for the root logger, rather than earlier.)

On Thu, Dec 12, 2019 at 4:34 PM Pablo Estrada  wrote:
>
> Hello all,
> It has been pointed out to me by Chad, and also by others, that my logging 
> changes have caused logs to start getting lost.
>
> It seems that by never logging on the root logger, initialization for a root 
> handler is skipped; and that's what causes the failures.
>
> I will work on a fix for this. I am thinking of providing a very simple 
> apache_beam.utils.get_logger function that does something like this:
>
> def get_logger(name):
>   logging.basicConfig()
>   return logging.getLogger(name)
>
> And specific paths that need special handling of the logs should override 
> this config by adding their own handlers (e.g. sdk_worker, fn_api_runner, 
> etc).
>
> I hope I can have a fix for this by tomorrow.
> Best
> -P.


Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Pablo Estrada
+1 (binding)

On Fri, Dec 13, 2019 at 8:47 AM Maximilian Michels  wrote:

> +1 (binding)
>
> On 13.12.19 17:10, Jeff Klukas wrote:
> > +1 (non-binding)
> >
> > On Thu, Dec 12, 2019 at 11:58 PM Kenneth Knowles  > > wrote:
> >
> > Please vote on the proposal for Beam's mascot to be the Firefly.
> > This encompasses the Lampyridae family of insects, without
> > specifying a genus or species.
> >
> > [ ] +1, Approve Firefly being the mascot
> > [ ] -1, Disapprove Firefly being the mascot
> >
> > The vote will be open for at least 72 hours excluding weekends. It
> > is adopted by at least 3 PMC +1 approval votes, with no PMC -1
> > disapproval votes*. Non-PMC votes are still encouraged.
> >
> > PMC voters, please help by indicating your vote as "(binding)"
> >
> > Kenn
> >
> > *I have chosen this format for this vote, even though Beam uses
> > simple majority as a rule, because I want any PMC member to be able
> > to veto based on concerns about overlap or trademark.
> >
>


Re: Python: pytest migration update

2019-12-13 Thread Udi Meiri
Update: 37m 
precommit time with the latest PR
 (in review).

On Tue, Dec 10, 2019 at 11:21 AM Udi Meiri  wrote:

>
>
> On Mon, Dec 9, 2019 at 9:33 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Mon, Dec 9, 2019 at 6:34 PM Udi Meiri  wrote:
>>
>>> Valentyn, the speedup is due to parallelization.
>>>
>>> On Mon, Dec 9, 2019 at 6:12 PM Chad Dombrova  wrote:
>>>

 On Mon, Dec 9, 2019 at 5:36 PM Udi Meiri  wrote:

> I have given this some thought honestly don't know if splitting into
> separate jobs will help.
> - I have seen race conditions with running setuptools in parallel, so
> more isolation is better.
>

 What race conditions have you seen?  I think if we're doing things
 right, this should not be happening, but I don't think we're doing things
 right. One thing that I've noticed is that we're building into the source
 directory, but I also think we're also doing weird things like trying to
 copy the source directory beforehand.  I really think this system is
 tripping over many non-standard choices that have been made along the way.
 I have never these sorts of problems with in unittests that use tox, even
 when many are running in parallel.  I got pulled away from it, but I'm
 really hoping to address these issues here:
 https://github.com/apache/beam/pull/10038.

>>>
>>> This comment
>>> 
>>> summarizes what I believe may be the issue (setuptools races).
>>>
>>> I believe copying the source directory was done in an effort to isolate
>>> the parallel builds (setuptools, cythonize).
>>>
>>
>> Peanut gallery: containerized Jenkins builds seem like they would help,
>> and they are the current recommended best practice, but we are not there
>> yet. Agree/disagree?
>>
>
> I'm okay with containerized Jenkins builds as long as using pytest/tox
> directly still works.
>
>
>>
>> What benefits do you see from splitting up the jobs?
>

 The biggest problem is that the jobs are doing too much and take too
 long.  This simple fact compounds all of the other problems.  It seems
 pretty obvious that we need to do less in each job, as long as the sum of
 all of these smaller jobs is not substantially longer than the one
 monolithic job.

>>>
>> For some reason I keep forgetting the answer to this question: are we
>> caching pypi immutable artifacts on every Jenkins worker?
>>
>
> I don't know.
>
>
>>
>>>
 Benefits:

 - failures specific to a particular python version will be easier to
 spot in the jenkins error summary, and cheaper to re-queue.  right now the
 jenkins report mushes all of the failures together in a way that makes it
 nearly impossible to tell which python version they correspond to.  only
 the gradle scan gives you this insight, but it doesn't break the errors by
 test.

>>>
>>> I agree Jenkins handles duplicate test names pretty badly (reloading
>>> will periodically give you a different result).
>>>
>>
>> Saw this in Java too w/ ValidatesRunner suites when they ran in one
>> Jenkins job. Worthwhile to avoid.
>>
>> Kenn
>>
>>
>>> With pytest I've been able to set the suite name so that should help
>>> with identification. (I need to add pytest*.xml collection to the Jenkins
>>> job first)
>>>
>>>
 - failures common to all python versions will be reported to the user
 earlier, at which point they can cancel the other jobs if desired.  *this
 is by far the biggest benefit. * why wait for 2 hours to see the same
 failure reported for 5 versions of python?  if that had run on one version
 of python I could maybe see that error in 30 minutes (while potentially
 other python versions waited in the queue).  Repeat for each change pushed.
 - flaky jobs will be cheaper to requeue (since it will affect a
 smaller/shorter job)
 - if xdist is giving us the parallel boost we're hoping for we should
 get under the 2 hour mark every time

 Basically we're talking about getting feedback to users faster.

>>>
>>> +1
>>>
>>>

 I really don't mind pasting a few more phrases if it means faster
 feedback.

 -chad




>
> On Mon, Dec 9, 2019 at 4:17 PM Chad Dombrova 
> wrote:
>
>> After this PR goes in should we revisit breaking up the python tests
>> into separate jenkins jobs by python version?  One of the problems with
>> that plan originally was that we lost the parallelism that gradle 
>> provides
>> because we were left with only one tox task per jenkins job, and so the
>> total time to complete all python jenkins jobs went up a lot.  With
>> pytest + xdist we should hopefully be able to keep the 

Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Maximilian Michels

+1 (binding)

On 13.12.19 17:10, Jeff Klukas wrote:

+1 (non-binding)

On Thu, Dec 12, 2019 at 11:58 PM Kenneth Knowles > wrote:


Please vote on the proposal for Beam's mascot to be the Firefly.
This encompasses the Lampyridae family of insects, without
specifying a genus or species.

[ ] +1, Approve Firefly being the mascot
[ ] -1, Disapprove Firefly being the mascot

The vote will be open for at least 72 hours excluding weekends. It
is adopted by at least 3 PMC +1 approval votes, with no PMC -1
disapproval votes*. Non-PMC votes are still encouraged.

PMC voters, please help by indicating your vote as "(binding)"

Kenn

*I have chosen this format for this vote, even though Beam uses
simple majority as a rule, because I want any PMC member to be able
to veto based on concerns about overlap or trademark.



Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Jeff Klukas
+1 (non-binding)

On Thu, Dec 12, 2019 at 11:58 PM Kenneth Knowles  wrote:

> Please vote on the proposal for Beam's mascot to be the Firefly. This
> encompasses the Lampyridae family of insects, without specifying a genus or
> species.
>
> [ ] +1, Approve Firefly being the mascot
> [ ] -1, Disapprove Firefly being the mascot
>
> The vote will be open for at least 72 hours excluding weekends. It is
> adopted by at least 3 PMC +1 approval votes, with no PMC -1 disapproval
> votes*. Non-PMC votes are still encouraged.
>
> PMC voters, please help by indicating your vote as "(binding)"
>
> Kenn
>
> *I have chosen this format for this vote, even though Beam uses simple
> majority as a rule, because I want any PMC member to be able to veto based
> on concerns about overlap or trademark.
>


Re: Beam's job crashes on cluster

2019-12-13 Thread Matthew K.
Actually the reason for that error is Job Server/JRE crashes at final stages and service becomes unavailable (note: job is on a very small dataset that is the absence of cluster, will be done in a couple of seconds):

 


19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 43
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 295
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 4
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on sparkpi-1576249172021-driver-svc.xyz.svc:7079 in memory (size: 14.4 KB, free: 967.8 MB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.102.238:46463 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 192.168.78.233:35881 in memory (size: 14.4 KB, free: 3.3 GB)
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 222
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 294
19/12/13 15:22:11 INFO ContextCleaner: Cleaned accumulator 37
<-> 98% EXECUTING [2m 26s]
> IDLE
> IDLE
> IDLE
> :runners:spark:job-server:runShadow
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f5ad7cd0d5e, pid=825, tid=0x7f5abb886700
#
# JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8f8d5e]  PerfLongVariant::sample()+0x1e
#
# Core dump written. Default location: /opt/spark/beam/core or core.825
#
# An error report file with more information is saved as:
# /opt/spark/beam/hs_err_pid825.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

 

 

From /opt/spark/beam/hs_err_pid825.log:

 

Internal exceptions (10 events):  
Event: 0.664 Thread 0x7f5ad000a800 Exception  (0x000794d72040) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.664 Thread 0x7f5ad000a800 Exception  (0x000794d73e60) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.665 Thread 0x7f5ad000a800 Exception  (0x000794d885d0) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.665 Thread 0x7f5ad000a800 Exception  (0x000794d8c6d8) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.673 Thread 0x7f5ad000a800 Exception  (0x000794df7b70) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.674 Thread 0x7f5ad000a800 Exception  (0x000794df8f38) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   Event: 0.674 Thread 0x7f5ad000a800 Exception  (0x000794dfa5b8) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.674 Thread 0x7f5ad000a800 Exception  (0x000794dfb6f0) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.674 Thread 0x7f5ad000a800 Exception  (0x000794dfedf0) thrown at [/home/openjdk/jdk8u/hotspot/src/share/vm/runtime/sharedRuntime.cpp, line 605]   
Event: 0.695 Thread 0x7f5ad000a800 Exception  (0x000794f69e70) 

Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Gleb Kanterov
+1 (non-binding)

On Fri, Dec 13, 2019 at 12:47 PM jincheng sun 
wrote:

> +1 (non-binding)
>
> Alex Van Boxel 于2019年12月13日 周五16:21写道:
>
>> +1
>>
>> On Fri, Dec 13, 2019, 05:58 Kenneth Knowles  wrote:
>>
>>> Please vote on the proposal for Beam's mascot to be the Firefly. This
>>> encompasses the Lampyridae family of insects, without specifying a genus or
>>> species.
>>>
>>> [ ] +1, Approve Firefly being the mascot
>>> [ ] -1, Disapprove Firefly being the mascot
>>>
>>> The vote will be open for at least 72 hours excluding weekends. It is
>>> adopted by at least 3 PMC +1 approval votes, with no PMC -1 disapproval
>>> votes*. Non-PMC votes are still encouraged.
>>>
>>> PMC voters, please help by indicating your vote as "(binding)"
>>>
>>> Kenn
>>>
>>> *I have chosen this format for this vote, even though Beam uses simple
>>> majority as a rule, because I want any PMC member to be able to veto based
>>> on concerns about overlap or trademark.
>>>
>> --
>
> Best,
> Jincheng
> -
> Committer & PMC Member at @ApacheFlink
> Staff Engineer at @Alibaba
> Blog: https://enjoyment.cool
> Twitter: https://twitter.com/sunjincheng121
> --
>


Re: Beam's job crashes on cluster

2019-12-13 Thread Matthew K.
Hi Kyle,

 

This is the pipeleine options config (I replaced localhost with actual job server's IP address, and still receive the same error. Do workers need to talk to job server independent from spark executors?):

 

options = PipelineOptions([
    "--runner=PortableRunner",
    "--job_endpoint=%s:8099" % ip_address,
    "--environment_type=PROCESS",
    "--environment_config={\"command\":\"/opt/spark/beam/sdks/python/container/build/target/launcher/linux_amd64/boot\"}",
    ""
    ])

 

 
 

Sent: Thursday, December 12, 2019 at 5:30 PM
From: "Kyle Weaver" 
To: dev 
Subject: Re: Beam's job crashes on cluster


Can you share the pipeline options you are using? Particularly environment_type and environment_config.
 


On Thu, Dec 12, 2019 at 2:58 PM Matthew K.  wrote:




Running Beam on Spark cluster, it crashhes and I get the following error (workers are on separate nodes, it works fine when workers are on the same node as runner):

 


> Task :runners:spark:job-server:runShadow FAILED
Exception in thread wait_until_finish_read:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/portable_runner.py", line 411, in read_messages
    for message in self._message_stream:
  File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 395, in next
    return self._next()
  File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 561, in _next
    raise self
_Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Socket closed"
    debug_error_string = "{"created":"@1576190515.361076583","description":"Error received from peer ipv4:127.0.0.1:8099","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Socket closed","grpc_status":14}"
>

Traceback (most recent call last):
  File "/opt/spark/work-dir/beam_script.py", line 49, in 
    stats = tfdv.generate_statistics_from_csv(data_location=DATA_LOCATION, pipeline_options=options)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.py", line 197, in generate_statistics_from_csv
    statistics_pb2.DatasetFeatureStatisticsList)))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 427, in __exit__
    self.run().wait_until_finish()
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/portable_runner.py", line 429, in wait_until_finish
    for state_response in self._state_stream:
  File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 395, in next
    return self._next()
  File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 561, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Socket closed"
    debug_error_string = "{"created":"@1576190515.361053677","description":"Error received from peer ipv4:127.0.0.1:8099","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Socket closed","grpc_status":14}"











Re: Committed without review. Sorry.

2019-12-13 Thread Łukasz Gajowy
Thanks for being transparent about this, Pablo. :)


Best,
Łukasz

pt., 13 gru 2019 o 06:32 Kenneth Knowles  napisał(a):

> Thanks for alerting dev@.
>
> IMO this was handled perfectly by all parties.
>
> Kenn
>
> On Thu, Dec 12, 2019 at 4:09 PM Valentyn Tymofieiev 
> wrote:
>
>> The change LGTM, so you can consider it reviewed.  In general it would be
>> nice to set up alerts to catch these situations, to make sure they don't go
>> unnoticed.
>>
>> Also as a reminder - please don't commit or merge PRs into release
>> branches without a review from a release manager.
>>
>> On Thu, Dec 12, 2019 at 3:44 PM Pablo Estrada  wrote:
>>
>>> Seed job runs okay:
>>> https://builds.apache.org/job/beam_SeedJob_Standalone/3865/console
>>>
>>>
>>> On Thu, Dec 12, 2019 at 3:28 PM Pablo Estrada 
>>> wrote:
>>>
 I accidentally committed a small change to master:
 https://github.com/apache/beam/commit/6018326ffe74aac7d8c44ded296b92f8b5c0b556

 I am verifying that this works as intended for now.

 What should we do about this? Revert? Leave as is if it works fine?
 Best
 -P

>>>


Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread jincheng sun
+1 (non-binding)

Alex Van Boxel 于2019年12月13日 周五16:21写道:

> +1
>
> On Fri, Dec 13, 2019, 05:58 Kenneth Knowles  wrote:
>
>> Please vote on the proposal for Beam's mascot to be the Firefly. This
>> encompasses the Lampyridae family of insects, without specifying a genus or
>> species.
>>
>> [ ] +1, Approve Firefly being the mascot
>> [ ] -1, Disapprove Firefly being the mascot
>>
>> The vote will be open for at least 72 hours excluding weekends. It is
>> adopted by at least 3 PMC +1 approval votes, with no PMC -1 disapproval
>> votes*. Non-PMC votes are still encouraged.
>>
>> PMC voters, please help by indicating your vote as "(binding)"
>>
>> Kenn
>>
>> *I have chosen this format for this vote, even though Beam uses simple
>> majority as a rule, because I want any PMC member to be able to veto based
>> on concerns about overlap or trademark.
>>
> --

Best,
Jincheng
-
Committer & PMC Member at @ApacheFlink
Staff Engineer at @Alibaba
Blog: https://enjoyment.cool
Twitter: https://twitter.com/sunjincheng121
--


Testing Apache Beam with JDK 14 EA builds

2019-12-13 Thread Rory O'Donnell


Hi,

I work on OpenJDK at Oracle and try to encourage popular open source 
projects to test their releases on latest
OpenJDK Early Access builds (i.e. JDK 14 -ea, atm), by providing them 
with regular information [0] describing
new builds, their features, and making sure that their bug reports and 
feedback land [1] in the right hands.


We don't expect projects to test every build, it's entirely up to you. 
We're already collaborating with developers
of Apache Ant, Apache Maven, Apache Lucene, Apache Tomcat and other 
similar projects, and would love to be

able to add Apache Beam to our list [2].

Rgds,Rory

[0] Example e-mail: 
https://mail.openjdk.java.net/pipermail/quality-discuss/2019-December/000908.html
[1] 
https://wiki.openjdk.java.net/display/quality/Quality+Outreach+report+September+2019 


[2] https://wiki.openjdk.java.net/display/quality/Quality+Outreach

--
Rgds, Rory O'Donnell
Quality Engineering Manager
Oracle EMEA, Dublin, Ireland



Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Alex Van Boxel
+1

On Fri, Dec 13, 2019, 05:58 Kenneth Knowles  wrote:

> Please vote on the proposal for Beam's mascot to be the Firefly. This
> encompasses the Lampyridae family of insects, without specifying a genus or
> species.
>
> [ ] +1, Approve Firefly being the mascot
> [ ] -1, Disapprove Firefly being the mascot
>
> The vote will be open for at least 72 hours excluding weekends. It is
> adopted by at least 3 PMC +1 approval votes, with no PMC -1 disapproval
> votes*. Non-PMC votes are still encouraged.
>
> PMC voters, please help by indicating your vote as "(binding)"
>
> Kenn
>
> *I have chosen this format for this vote, even though Beam uses simple
> majority as a rule, because I want any PMC member to be able to veto based
> on concerns about overlap or trademark.
>