Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Alex Amato
Thanks Chad, this was helpful. :)

Btw, I think this helps my PR format somewhat, but some more checks are
ru, not covered by this tool when I push the PR.

My PR is running more checks under
*:sdks:python:test-suites:tox:py37:mypyPy37*

I am curious if anyone knows a good command line to try before pushing PRs
to catch these issues locally first? (I had one in the past, but I think
its outdated).



On Wed, Oct 28, 2020 at 8:41 PM Pablo Estrada  wrote:

> woah I didn't know about this tool at all Chad. It looks nice : )
> FWIW, if you feel up to it, I've given you edit access to the Beam wiki (
> https://cwiki.apache.org/confluence/display/BEAM) in case you'd like to
> add the tip.
> Thanks!
> -P.
>
> On Wed, Oct 28, 2020 at 8:09 PM Chad Dombrova  wrote:
>
>> I would like to edit it!  I have an apache account and I am a committed
>> but IIRC I could not edit it with my normal credentials.
>>
>>
>> On Wed, Oct 28, 2020 at 8:02 PM Robert Burke  wrote:
>>
>>> (it's a wiki, so anyone who requests and account can improve it)
>>>
>>> On Wed, Oct 28, 2020, 7:45 PM Chad Dombrova  wrote:
>>>
 It’s unfortunate that those instructions don’t include pre-commit,
 which is by far the easiest way to do this.

 To set it up:

 pip install pre-commit
 pre-commit install

 Install sets up git pre-commit hooks so that it will run yapf and
 pylint on changed files every time you commit (you’ll need python3.7. I
 think it should be possible to loosen this, as this has been an annoyance
 for me)

 To skip running the check on commit add -n:

 git commit -nm "blah blah"

 Alternatively, to run the check manually on changed files (pre-commit
 install is not required to run it this way):

 pre-commit run yapf

 Or on all files:

 pre-commit run -a yapf

 More info here: https://pre-commit.com/#config-language_version

 On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:

> I tried both the tox and yapf instructions on the python tips page
> .
> And the gradle target which failed on PR precommit. I am wondering if 
> there
> is something additional I need to setup?
>
> Here is the output from all three attempts approaches I attempted. Any
> ideas how to get this working?
>
> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
> --name-only --relative bigquery_python_sdk origin/master | xargs yapf
> --in-place*
> Traceback (most recent call last):
>   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
> 
> sys.exit(run_main())
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 365, in run_main
> sys.exit(main(sys.argv))
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 135, in main
> verbose=args.verbose)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 204, in FormatFiles
> in_place, print_diff, verify, quiet, verbose)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 233, in _FormatFile
> logger=logging.warning)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
> line 100, in FormatFile
> verify=verify)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
> line 147, in FormatCode
> tree = pytree_utils.ParseCodeToTree(unformatted_source)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
> line 127, in ParseCodeToTree
> raise e
>   File "apache_beam/metrics/execution.pxd", line 18
> cimport cython
>  ^
> SyntaxError: invalid syntax
>
> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e
> py3-yapf*
> GLOB sdist-make:
> /usr/local/google/home/ajamato/beam/sdks/python/setup.py
> py3-yapf create:
> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
> ERROR: invocation failed (exit code 1), logfile:
> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
> ===
> log start
> 
> RuntimeError: failed to build image pkg_resources because:
> Traceback (most recent call last):
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
> line 

Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Pablo Estrada
woah I didn't know about this tool at all Chad. It looks nice : )
FWIW, if you feel up to it, I've given you edit access to the Beam wiki (
https://cwiki.apache.org/confluence/display/BEAM) in case you'd like to add
the tip.
Thanks!
-P.

On Wed, Oct 28, 2020 at 8:09 PM Chad Dombrova  wrote:

> I would like to edit it!  I have an apache account and I am a committed
> but IIRC I could not edit it with my normal credentials.
>
>
> On Wed, Oct 28, 2020 at 8:02 PM Robert Burke  wrote:
>
>> (it's a wiki, so anyone who requests and account can improve it)
>>
>> On Wed, Oct 28, 2020, 7:45 PM Chad Dombrova  wrote:
>>
>>> It’s unfortunate that those instructions don’t include pre-commit, which
>>> is by far the easiest way to do this.
>>>
>>> To set it up:
>>>
>>> pip install pre-commit
>>> pre-commit install
>>>
>>> Install sets up git pre-commit hooks so that it will run yapf and pylint
>>> on changed files every time you commit (you’ll need python3.7. I think it
>>> should be possible to loosen this, as this has been an annoyance for me)
>>>
>>> To skip running the check on commit add -n:
>>>
>>> git commit -nm "blah blah"
>>>
>>> Alternatively, to run the check manually on changed files (pre-commit
>>> install is not required to run it this way):
>>>
>>> pre-commit run yapf
>>>
>>> Or on all files:
>>>
>>> pre-commit run -a yapf
>>>
>>> More info here: https://pre-commit.com/#config-language_version
>>>
>>> On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:
>>>
 I tried both the tox and yapf instructions on the python tips page
 .
 And the gradle target which failed on PR precommit. I am wondering if there
 is something additional I need to setup?

 Here is the output from all three attempts approaches I attempted. Any
 ideas how to get this working?

 *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
 --name-only --relative bigquery_python_sdk origin/master | xargs yapf
 --in-place*
 Traceback (most recent call last):
   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
 
 sys.exit(run_main())
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
 line 365, in run_main
 sys.exit(main(sys.argv))
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
 line 135, in main
 verbose=args.verbose)
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
 line 204, in FormatFiles
 in_place, print_diff, verify, quiet, verbose)
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
 line 233, in _FormatFile
 logger=logging.warning)
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
 line 100, in FormatFile
 verify=verify)
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
 line 147, in FormatCode
 tree = pytree_utils.ParseCodeToTree(unformatted_source)
   File
 "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
 line 127, in ParseCodeToTree
 raise e
   File "apache_beam/metrics/execution.pxd", line 18
 cimport cython
  ^
 SyntaxError: invalid syntax

 *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e
 py3-yapf*
 GLOB sdist-make:
 /usr/local/google/home/ajamato/beam/sdks/python/setup.py
 py3-yapf create:
 /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
 ERROR: invocation failed (exit code 1), logfile:
 /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
 ===
 log start
 
 RuntimeError: failed to build image pkg_resources because:
 Traceback (most recent call last):
   File
 "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
 line 60, in _install
 installer.install(creator.interpreter.version_info)
   File
 "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/base.py",
 line 42, in install
 self._sync(filename, into)
   File
 "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/copy.py",
 line 13, in _sync
 copy(src, dst)
   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
 line 53, in copy
 method(norm(src), norm(dest))
   File 

Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Chad Dombrova
I would like to edit it!  I have an apache account and I am a committed but
IIRC I could not edit it with my normal credentials.


On Wed, Oct 28, 2020 at 8:02 PM Robert Burke  wrote:

> (it's a wiki, so anyone who requests and account can improve it)
>
> On Wed, Oct 28, 2020, 7:45 PM Chad Dombrova  wrote:
>
>> It’s unfortunate that those instructions don’t include pre-commit, which
>> is by far the easiest way to do this.
>>
>> To set it up:
>>
>> pip install pre-commit
>> pre-commit install
>>
>> Install sets up git pre-commit hooks so that it will run yapf and pylint
>> on changed files every time you commit (you’ll need python3.7. I think it
>> should be possible to loosen this, as this has been an annoyance for me)
>>
>> To skip running the check on commit add -n:
>>
>> git commit -nm "blah blah"
>>
>> Alternatively, to run the check manually on changed files (pre-commit
>> install is not required to run it this way):
>>
>> pre-commit run yapf
>>
>> Or on all files:
>>
>> pre-commit run -a yapf
>>
>> More info here: https://pre-commit.com/#config-language_version
>>
>> On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:
>>
>>> I tried both the tox and yapf instructions on the python tips page
>>> .
>>> And the gradle target which failed on PR precommit. I am wondering if there
>>> is something additional I need to setup?
>>>
>>> Here is the output from all three attempts approaches I attempted. Any
>>> ideas how to get this working?
>>>
>>> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
>>> --name-only --relative bigquery_python_sdk origin/master | xargs yapf
>>> --in-place*
>>> Traceback (most recent call last):
>>>   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
>>> 
>>> sys.exit(run_main())
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 365, in run_main
>>> sys.exit(main(sys.argv))
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 135, in main
>>> verbose=args.verbose)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 204, in FormatFiles
>>> in_place, print_diff, verify, quiet, verbose)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 233, in _FormatFile
>>> logger=logging.warning)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
>>> line 100, in FormatFile
>>> verify=verify)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
>>> line 147, in FormatCode
>>> tree = pytree_utils.ParseCodeToTree(unformatted_source)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
>>> line 127, in ParseCodeToTree
>>> raise e
>>>   File "apache_beam/metrics/execution.pxd", line 18
>>> cimport cython
>>>  ^
>>> SyntaxError: invalid syntax
>>>
>>> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e
>>> py3-yapf*
>>> GLOB sdist-make: /usr/local/google/home/ajamato/beam/sdks/python/setup.py
>>> py3-yapf create:
>>> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
>>> ERROR: invocation failed (exit code 1), logfile:
>>> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
>>> ===
>>> log start
>>> 
>>> RuntimeError: failed to build image pkg_resources because:
>>> Traceback (most recent call last):
>>>   File
>>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
>>> line 60, in _install
>>> installer.install(creator.interpreter.version_info)
>>>   File
>>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/base.py",
>>> line 42, in install
>>> self._sync(filename, into)
>>>   File
>>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/copy.py",
>>> line 13, in _sync
>>> copy(src, dst)
>>>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
>>> line 53, in copy
>>> method(norm(src), norm(dest))
>>>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
>>> line 64, in copytree
>>> shutil.copy(src_f, dest_f)
>>>   File "/usr/lib/python3.8/shutil.py", line 415, in copy
>>> copyfile(src, dst, follow_symlinks=follow_symlinks)
>>>   File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
>>> with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
>>> FileNotFoundError: [Errno 2] No such file or directory:
>>> 

Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Robert Burke
(it's a wiki, so anyone who requests and account can improve it)

On Wed, Oct 28, 2020, 7:45 PM Chad Dombrova  wrote:

> It’s unfortunate that those instructions don’t include pre-commit, which
> is by far the easiest way to do this.
>
> To set it up:
>
> pip install pre-commit
> pre-commit install
>
> Install sets up git pre-commit hooks so that it will run yapf and pylint
> on changed files every time you commit (you’ll need python3.7. I think it
> should be possible to loosen this, as this has been an annoyance for me)
>
> To skip running the check on commit add -n:
>
> git commit -nm "blah blah"
>
> Alternatively, to run the check manually on changed files (pre-commit
> install is not required to run it this way):
>
> pre-commit run yapf
>
> Or on all files:
>
> pre-commit run -a yapf
>
> More info here: https://pre-commit.com/#config-language_version
>
> On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:
>
>> I tried both the tox and yapf instructions on the python tips page
>> .
>> And the gradle target which failed on PR precommit. I am wondering if there
>> is something additional I need to setup?
>>
>> Here is the output from all three attempts approaches I attempted. Any
>> ideas how to get this working?
>>
>> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
>> --name-only --relative bigquery_python_sdk origin/master | xargs yapf
>> --in-place*
>> Traceback (most recent call last):
>>   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
>> 
>> sys.exit(run_main())
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>> line 365, in run_main
>> sys.exit(main(sys.argv))
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>> line 135, in main
>> verbose=args.verbose)
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>> line 204, in FormatFiles
>> in_place, print_diff, verify, quiet, verbose)
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>> line 233, in _FormatFile
>> logger=logging.warning)
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
>> line 100, in FormatFile
>> verify=verify)
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
>> line 147, in FormatCode
>> tree = pytree_utils.ParseCodeToTree(unformatted_source)
>>   File
>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
>> line 127, in ParseCodeToTree
>> raise e
>>   File "apache_beam/metrics/execution.pxd", line 18
>> cimport cython
>>  ^
>> SyntaxError: invalid syntax
>>
>> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e
>> py3-yapf*
>> GLOB sdist-make: /usr/local/google/home/ajamato/beam/sdks/python/setup.py
>> py3-yapf create:
>> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
>> ERROR: invocation failed (exit code 1), logfile:
>> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
>> ===
>> log start
>> 
>> RuntimeError: failed to build image pkg_resources because:
>> Traceback (most recent call last):
>>   File
>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
>> line 60, in _install
>> installer.install(creator.interpreter.version_info)
>>   File
>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/base.py",
>> line 42, in install
>> self._sync(filename, into)
>>   File
>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/copy.py",
>> line 13, in _sync
>> copy(src, dst)
>>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
>> line 53, in copy
>> method(norm(src), norm(dest))
>>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
>> line 64, in copytree
>> shutil.copy(src_f, dest_f)
>>   File "/usr/lib/python3.8/shutil.py", line 415, in copy
>> copyfile(src, dst, follow_symlinks=follow_symlinks)
>>   File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
>> with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
>> FileNotFoundError: [Errno 2] No such file or directory:
>> '/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/lib/python3.8/site-packages/pkg_resources/_vendor/packaging/__init__.py'
>>
>>
>> 
>> log end
>> 

Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Chad Dombrova
It’s unfortunate that those instructions don’t include pre-commit, which is
by far the easiest way to do this.

To set it up:

pip install pre-commit
pre-commit install

Install sets up git pre-commit hooks so that it will run yapf and pylint on
changed files every time you commit (you’ll need python3.7. I think it
should be possible to loosen this, as this has been an annoyance for me)

To skip running the check on commit add -n:

git commit -nm "blah blah"

Alternatively, to run the check manually on changed files (pre-commit
install is not required to run it this way):

pre-commit run yapf

Or on all files:

pre-commit run -a yapf

More info here: https://pre-commit.com/#config-language_version

On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:

> I tried both the tox and yapf instructions on the python tips page
> .
> And the gradle target which failed on PR precommit. I am wondering if there
> is something additional I need to setup?
>
> Here is the output from all three attempts approaches I attempted. Any
> ideas how to get this working?
>
> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
> --name-only --relative bigquery_python_sdk origin/master | xargs yapf
> --in-place*
> Traceback (most recent call last):
>   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
> 
> sys.exit(run_main())
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 365, in run_main
> sys.exit(main(sys.argv))
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 135, in main
> verbose=args.verbose)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 204, in FormatFiles
> in_place, print_diff, verify, quiet, verbose)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 233, in _FormatFile
> logger=logging.warning)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
> line 100, in FormatFile
> verify=verify)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
> line 147, in FormatCode
> tree = pytree_utils.ParseCodeToTree(unformatted_source)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
> line 127, in ParseCodeToTree
> raise e
>   File "apache_beam/metrics/execution.pxd", line 18
> cimport cython
>  ^
> SyntaxError: invalid syntax
>
> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e py3-yapf*
> GLOB sdist-make: /usr/local/google/home/ajamato/beam/sdks/python/setup.py
> py3-yapf create:
> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
> ERROR: invocation failed (exit code 1), logfile:
> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
> ===
> log start
> 
> RuntimeError: failed to build image pkg_resources because:
> Traceback (most recent call last):
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
> line 60, in _install
> installer.install(creator.interpreter.version_info)
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/base.py",
> line 42, in install
> self._sync(filename, into)
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/copy.py",
> line 13, in _sync
> copy(src, dst)
>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
> line 53, in copy
> method(norm(src), norm(dest))
>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
> line 64, in copytree
> shutil.copy(src_f, dest_f)
>   File "/usr/lib/python3.8/shutil.py", line 415, in copy
> copyfile(src, dst, follow_symlinks=follow_symlinks)
>   File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
> with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
> FileNotFoundError: [Errno 2] No such file or directory:
> '/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/lib/python3.8/site-packages/pkg_resources/_vendor/packaging/__init__.py'
>
>
> 
> log end
> =
> ERROR: InvocationError for command /usr/bin/python3 -m virtualenv
> --no-download --python /usr/bin/python3 py3-yapf (exited with code 1)
> 

Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Alex Amato
I tried both the tox and yapf instructions on the python tips page
.
And the gradle target which failed on PR precommit. I am wondering if there
is something additional I need to setup?

Here is the output from all three attempts approaches I attempted. Any
ideas how to get this working?

*(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
--name-only --relative bigquery_python_sdk origin/master | xargs yapf
--in-place*
Traceback (most recent call last):
  File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in 
sys.exit(run_main())
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
line 365, in run_main
sys.exit(main(sys.argv))
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
line 135, in main
verbose=args.verbose)
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
line 204, in FormatFiles
in_place, print_diff, verify, quiet, verbose)
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
line 233, in _FormatFile
logger=logging.warning)
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
line 100, in FormatFile
verify=verify)
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
line 147, in FormatCode
tree = pytree_utils.ParseCodeToTree(unformatted_source)
  File
"/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
line 127, in ParseCodeToTree
raise e
  File "apache_beam/metrics/execution.pxd", line 18
cimport cython
 ^
SyntaxError: invalid syntax

*(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e py3-yapf*
GLOB sdist-make: /usr/local/google/home/ajamato/beam/sdks/python/setup.py
py3-yapf create:
/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
ERROR: invocation failed (exit code 1), logfile:
/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
===
log start

RuntimeError: failed to build image pkg_resources because:
Traceback (most recent call last):
  File
"/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
line 60, in _install
installer.install(creator.interpreter.version_info)
  File
"/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/base.py",
line 42, in install
self._sync(filename, into)
  File
"/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/copy.py",
line 13, in _sync
copy(src, dst)
  File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py", line
53, in copy
method(norm(src), norm(dest))
  File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py", line
64, in copytree
shutil.copy(src_f, dest_f)
  File "/usr/lib/python3.8/shutil.py", line 415, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory:
'/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/lib/python3.8/site-packages/pkg_resources/_vendor/packaging/__init__.py'



log end
=
ERROR: InvocationError for command /usr/bin/python3 -m virtualenv
--no-download --python /usr/bin/python3 py3-yapf (exited with code 1)

summary
_
ERROR:   py3-yapf: InvocationError for command /usr/bin/python3 -m
virtualenv --no-download --python /usr/bin/python3 py3-yapf (exited with
code 1)
(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$



*ajamato@ajamato-linux0:~/beam$ ./gradlew
:sdks:python:test-suites:tox:py38:formatter*
To honour the JVM settings for this build a new JVM will be forked. Please
consider using the daemon:
https://docs.gradle.org/6.6.1/userguide/gradle_daemon.html.
Daemon will be stopped at the end of the build stopping after processing
Configuration on demand is an incubating feature.

> Task :sdks:python:test-suites:tox:py38:formatter
GLOB sdist-make:
/usr/local/google/home/ajamato/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/setup.py
py3-yapf-check recreate:

Re: Proposal: ToStringFn

2020-10-28 Thread Sam Rohde
done!

On Wed, Oct 28, 2020 at 3:54 PM Tyson Hamilton  wrote:

> Can you open up comment access please?
>
> On Wed, Oct 28, 2020 at 3:40 PM Sam Rohde  wrote:
>
>> +Lukasz Cwik 
>>
>> On Tue, Oct 27, 2020 at 12:04 PM Sam Rohde  wrote:
>>
>>> Hi All,
>>>
>>> I'm working on a project in Dataflow that requires the runner to
>>> translate an element to a human-readable form. To do this, I want to add a
>>> new well-known transform that allows any runner to ask the SDK to stringify
>>> (human-readable) an element. Let me know what you think, you can find
>>> the proposed specification and implementation details here
>>> 
>>> .
>>>
>>> If there are no objections, I want to start implementation as soon as I
>>> can.
>>>
>>> Regards,
>>> Sam
>>>
>>


Re: Proposal: ToStringFn

2020-10-28 Thread Tyson Hamilton
Can you open up comment access please?

On Wed, Oct 28, 2020 at 3:40 PM Sam Rohde  wrote:

> +Lukasz Cwik 
>
> On Tue, Oct 27, 2020 at 12:04 PM Sam Rohde  wrote:
>
>> Hi All,
>>
>> I'm working on a project in Dataflow that requires the runner to
>> translate an element to a human-readable form. To do this, I want to add a
>> new well-known transform that allows any runner to ask the SDK to stringify
>> (human-readable) an element. Let me know what you think, you can find
>> the proposed specification and implementation details here
>> 
>> .
>>
>> If there are no objections, I want to start implementation as soon as I
>> can.
>>
>> Regards,
>> Sam
>>
>


Re: Jenkins Java Version Confusion/Failures

2020-10-28 Thread Tyson Hamilton
Thanks Udi & Infra team for helping me out here. This was discussed on the 
builds mailing list [1] and in the INFRA ticket [2]. Testing a PR that should 
fix,

https://github.com/apache/beam/pull/13213

[1]: 
https://lists.apache.org/thread.html/rb4c2834b9874b9f4a74c528de9055958483d2bc6e62c3464bc5c053f%40%3Cbuilds.apache.org%3E
[2]: https://issues.apache.org/jira/browse/INFRA-20858


On 2020/10/28 18:24:48, Tyson Hamilton  wrote: 
> Hi All,
> 
> This is an early notice that I'm seeing an issue in Jenkins related to JDK 
> versions. I've created an INFRA issue [1] to get support from Apache, see the 
> details there if your interested. What it looks like is the nightly artifacts 
> are being built with JDK11 [2] and the issue started on October 18th. I don't 
> see any related changes in our configs, so I'm thinking it may be infra 
> related. If you know more please speak up!
> 
> Snapshot error looks JDK11 related:
> 
> An exception occured while executing the Java class. 
> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
> There is also this in various java jobs (including inventory jobs):
> 
> No JDK named ‘JDK 1.8 (latest)’ found
> 
> Follow the INFRA ticket for updates.
> 
> -Tyson
> 
> [1]: https://issues.apache.org/jira/browse/INFRA-21043
> [2]: 
> https://ci-beam.apache.org/job/beam_PostRelease_NightlySnapshot/1144/consoleFull
> 


Re: Proposal: ToStringFn

2020-10-28 Thread Sam Rohde
+Lukasz Cwik 

On Tue, Oct 27, 2020 at 12:04 PM Sam Rohde  wrote:

> Hi All,
>
> I'm working on a project in Dataflow that requires the runner to translate
> an element to a human-readable form. To do this, I want to add a new
> well-known transform that allows any runner to ask the SDK to stringify
> (human-readable) an element. Let me know what you think, you can find
> the proposed specification and implementation details here
> 
> .
>
> If there are no objections, I want to start implementation as soon as I
> can.
>
> Regards,
> Sam
>


Re: Guarding Python setup function in __main__ checker

2020-10-28 Thread Alex Amato
Found this stackoverflow question.
https://stackoverflow.com/questions/59831397/run-setup-function-from-setuptools-only-if-name-main

>From what I understand, normally its not needed since setup.py is generally
not imported. But yes, a main guard would prevent code from running if the
setup.py is imported.
I think it would be best to first understand why setup.py being imported in
the first place. Which is a bit odd.

Does one of our dependencies or tools do this? Is that tool making an
assumption that we have main guarded our setup.py?
It seems like run_integration_test.sh

runs setup.py itself, via nose. Is this an intended behaviour of nose? Is
it doing some setup first then running setup.py?



On Wed, Oct 28, 2020 at 1:38 PM Heejong Lee  wrote:

> I've encountered the following error while I was testing Python
> integration tests via `run_integration_test.sh` on MacOS:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/spawn.py",
> line 116, in spawn_main
> exitcode = _main(fd, parent_sentinel)
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/spawn.py",
> line 125, in _main
> prepare(preparation_data)
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/spawn.py",
> line 236, in prepare
> _fixup_main_from_path(data['init_main_from_path'])
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/spawn.py",
> line 287, in _fixup_main_from_path
> main_content = runpy.run_path(main_path,
>   File "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line
> 265, in run_path
> return _run_module_code(code, init_globals, run_name,
>   File "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line
> 97, in _run_module_code
> _run_code(code, mod_globals, init_globals,
>   File "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line
> 87, in _run_code
> exec(code, run_globals)
>   File "/Users/heejong/Work/beam/sdks/python/setup.py", line 259, in
> 
> setuptools.setup(
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/setuptools/__init__.py",
> line 153, in setup
> return distutils.core.setup(**attrs)
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/distutils/core.py",
> line 148, in setup
> dist.run_commands()
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/distutils/dist.py",
> line 966, in run_commands
> self.run_command(cmd)
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/distutils/dist.py",
> line 985, in run_command
> cmd_obj.run()
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/commands.py",
> line 158, in run
> TestProgram(argv=argv, config=self.__config)
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/core.py",
> line 118, in __init__
> unittest.TestProgram.__init__(
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/unittest/main.py", line
> 100, in __init__
> self.parseArgs(argv)
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/core.py",
> line 145, in parseArgs
> self.config.configure(argv, doc=self.usage())
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/config.py",
> line 346, in configure
> self.plugins.configure(options, self)
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/plugins/manager.py",
> line 284, in configure
> cfg(options, config)
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/plugins/manager.py",
> line 99, in __call__
> return self.call(*arg, **kw)
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose/plugins/manager.py",
> line 167, in simple
> result = meth(*arg, **kw)
>   File
> "/Users/heejong/Work/beam/build/gradleenv/192237/lib/python3.8/site-packages/nose_xunitmp.py",
> line 42, in configure
> manager = multiprocessing.Manager()
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/context.py",
> line 57, in Manager
> m.start()
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/managers.py",
> line 579, in start
> self._process.start()
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py",
> line 121, in start
> self._popen = self._Popen(self)
>   File
> "/Users/heejong/.pyenv/versions/3.8.6/lib/python3.8/multiprocessing/context.py",
> line 284, in _Popen
> return Popen(process_obj)
>   File
> 

Re: Problem with updating Hadoop and Hcatalog dependencies

2020-10-28 Thread Tyson Hamilton



On 2020/10/22 07:35:47, Piotr Szuberski  wrote: 
> I'm trying to update Hadoop dependencies to the recent 3.3.0 and I've 
> encountered a problem - the Hadoop related checks seem to work without any 
> further changes but Hcatalog requires to be bumped to 3.x.y versions as well 
> (2.x.y versions require Hadoop 2.x.y).

Are there any backwards incompatible changes with the Hadoop major version 
bump? If so it may be better to push this off until the next major Beam release 
and just to a minor version bump for Hadoop.

> 
> When I use Hadoop 3.3.0 there is a guava jar versions related exception [1] 
> which I tried to solve by enforcing Guava 27.0-jre which is used by Hadoop 
> 3.3.0 - without success.
> 
> Then I used Hadoop 3.2.0 which doesn't have guava updated and Hive 3.1.2. I 
> also replaced hive-site.xml with the recent default one from Hive's master. 
> Then 4 tests from io/hcatalog are failing: 
> testWriteThenReadSuccess - with exception [2]
> testWriteThenUnboundedReadSuccess - with the same exception.
> 
> As far as I deduced it's a bit misleading because setOutput indeed is called 
> in HCatalogIO.Write's  writerContext = masterWriter.prepareWrite() - which 
> under the hood tries to call setOutput and fails.
> 
> The probable cause could be Hcatalog configuration. But I definitely lack 
> knowledge how to set it up, especially the Hcatalog's  version 3.x 
> documentation really doesn't help.
> 
> Do we have anyone with some knowledge about HCatalog that could help me with 
> this?

timrobertson100@ is listed as the owner of this dependency [1].

[1] 
https://github.com/apache/beam/blob/master/ownership/JAVA_DEPENDENCY_OWNERS.yaml

> 
> 
> [1] NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;J)V
> [2] org.apache.beam.sdk.util.UserCodeException: 
> org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not 
> initialized, setOutput has to be called. Cause : 
> org.apache.hive.hcatalog.common.HCatException : 2001 : Error setting output 
> information. Cause : java.lang.NullPointerException
> 


Re: [Call for items] Beam October 2020 Newsletter

2020-10-28 Thread Brittany Hermann
Hi Daniel,

Thanks for reaching out! Our intended audience is for both dev & users. The
purpose of this newsletter is to give folks the visibility to the insights
of the health and growth of the community. I am aware that the previous
Beam Community newsletter showcased updates about the project. Other
contributors are actually working on bringing this part back and having
that newsletter focused specifically on project updates.

Hopefully this clarifies everything that you are looking for! Please let me
know if you have any other questions.

On Mon, Oct 26, 2020 at 5:48 PM Daniel Oliveira 
wrote:

> Hi Brittany,
>
> I gave the newsletter a look, but I still have two questions about it:
>
> 1. Who's the intended audience for this newsletter, devs or users? Because
> I imagine there would probably be a difference in content between the two.
> Dev news might include infrastructure changes that are irrelevant to users
> (like news about our testing, or our repo, etc.), as well as mentioning
> changes that might be in the master branch but have not yet been released.
> And vice versa, if it's aimed at users it will probably need to be
> constrained to features that have been released, so users don't expect
> features they can't actually use yet.
>
> 2. Along similar lines, are SDK feature announcements appropriate for this
> newsletter? (For example, announcing that SplittableDoFn is available for
> implementing sources in the Go SDK.) If the newsletter is aimed at users, I
> imagine that new features are probably more appropriate in the Beam release
> changelog (with maybe some highlights called out here). On the other hand,
> if it's dev focused, then this seems like a good platform to share progress
> that we as a community have made on various different projects, I'm just
> not sure what section would be most appropriate for it.
>
> Aside from those two questions, this newsletter looks really cool! I
> especially like the social media and online engagement sections. For
> someone like me who rarely pays attention to social media, it's nice to see
> a summary like that.
>
> Thanks,
> Daniel Oliveira
>
> On Mon, Oct 26, 2020 at 11:06 AM Brittany Hermann 
> wrote:
>
>> Hi everyone,
>>
>> I am thinking of creating a monthly Beam Community newsletter like this:
>> https://docs.google.com/document/d/1_t6xKoOQVwgn2edmRVh1ViudmbnNM3BwZyguKAwwjfA/edit?usp=sharing
>>
>> My intention would be to send this out once a month. Could you please add
>> in any updates that you would like to share with the community by 10/28 at
>> 11:59 pm PST? I am planning to edit and send the final version through the
>> mailing list on 10/30.
>>
>> Thank you!
>> -Brittany Hermann
>>
>>
>>

-- 

Brittany Hermann

Open Source Program Manager (Provided by Adecco Staffing)

1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089



Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-28 Thread Maximilian Michels
You are right that Flink serializers do not care to copy for immutable 
Java types, e.g. Long, Integer, String. However, Pojos or other custom 
types can be mutated and Flink does a deep copy in this case.


If you look at the PojoSerializer in Flink, you will see that it does a 
deep copy: 
https://github.com/apache/flink/blob/d13f66be552eac89f45469c199ae036087baa38d/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L228


Also Flink uses Java serialization if the generic Kryo serializer fails: 
https://github.com/apache/flink/blob/d13f66be552eac89f45469c199ae036087baa38d/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializer.java#L251


In Beam we are just wrapping around Beam coders, so we do not know if a 
type is mutable or not. This is why we always deep-copy. I'm not sure 
that the change to always return the input would be safe. However, we 
could add some exceptions for Beam types we are sure cannot be mutated. 
Also, a pipeline option is ok if it is opt-in.


-Max

On 28.10.20 11:59, Teodor Spæren wrote:

Hey Max!

Just to make sure we are on the same page:

When you say "Flinks default behavior" do you mean Apache Flink the 
project or Beams Flink Runner? I'm assuming the Flink Runner, since 
Apache Flink leaves it up to the TypeSerializer to decide how to copy 
and none of the ones I've seen so far choose to do it through a 
serialization then deserialization method.


Is the bug hard to detect? Using the direct runner will warn of this 
missuse by default, without any need to change the pipeline itself, as 
far as I know? Please correct me if I'm wrong, I don't have much 
experience with Beam!


PCollections being immutable does mean that the input element should not 
be modified, rather if a modification is needed it is up to the user to 
copy the input before changing it [1](3.2.3 Immutability). I think this 
is what you are saying, but I just wanted to make sure :)


Also, I think naming the flag anything with object reuse would be 
confusing to users, as flink already has the concept of object reuse 
[2](enableObjectReuse), and there is an option on the runner mentioning 
this already[3](objectReuse). I'm thinking something more along the 
lines of "fasterCopy" or "disableFailsafeCopying".


Best regards,
Teodor Spæren

[1]: 
https://beam.apache.org/documentation/programming-guide/#pcollection-characteristics 

[2]: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/execution_configuration.html 

[3]: 
https://beam.apache.org/documentation/runners/flink/#pipeline-options-for-the-flink-runner 



On Wed, Oct 28, 2020 at 10:31:51AM +0100, Maximilian Michels wrote:

Very good observation @Teodor!

Flink's default behavior is to copy elements by going through a 
serialization/deserialization roundtrip. This occurs regardless of 
whether operatores are "chained" (directly pass on data without going 
through the network) or not.


This default was chosen for a reason because users tend to forget that 
you must not modify the input. It can cause very hard to detect "bugs".


PCollections are immutable but that does not mean transforms are not 
allowed to mutate inputs. Rather it means that the original element 
will not change but a copy of that element.


+1 for a flag to enable object reuse
-1 for changing the default for the above reason

Cheers,
Max

On 27.10.20 21:54, Kenneth Knowles wrote:

It seems that many correct things are said on this thread.

1. Elements of a PCollection are immutable. They should be like 
mathematical values.
2. For performance reasons, the author of a DoFn is responsible to 
not mutate input elements and also to not mutate outputs once they 
have been output.

3. The direct runner does extra work to check if your DoFn* is wrong.
4. On a production runner it is expected that serialization only 
occurs when needed for shipping data**


If the FlinkRunner is serializing things that don't have to be 
shipped that seems like a great easy win.


Kenn

*notably CombineFn has an API that is broken; only the first 
accumulator is allowed to be mutated and a runner is responsible for 
cloning it as necessary; it is expected that combining many elements 
will execute by mutating one unaliased accumulator many times
**small caveat that when doing in-memory groupings you need to use 
Coder#structuralValue and group by that, which may serialize but 
hopefully does something smarter


On Tue, Oct 27, 2020 at 8:52 AM Reuven Lax > wrote:


   Actually I believe that the Beam model does say that input elements
   should be immutable. If I remember correctly, the DirectRunner even
   validates this in unit tests, failing tests if the input elements
   have been mutated.

   On Tue, Oct 27, 2020 at 3:49 AM David Morávek mailto:d...@apache.org>> wrote:

   Hi Teodor,

   Thanks for bringing this up. This is a known, long standing
   "issue". Unfortunately there are 

Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-28 Thread Teodor Spæren
Ok, I just ran all the tests with my change and all of them pass with a 
simple "return v;" instead of the serialize and deserialize. This is 
encouraging!


A question about implementation. The idea is to set this via an Pipeline 
option, but a problem is that most places which use CoderTypeSerializer 
don't pass down the pipeline options. I can go through each and try to 
pass it down, but are there any easier way? Some global variable or an 
earlier point where we could do this? Or simply just remove the 
constructor without pipeline options?


Best Regards,
Teodor Spæren

On Tue, Oct 27, 2020 at 02:35:10PM +0100, Teodor Spæren wrote:
@David, I don't know how the direct runner does the validation, so I'm 
not sure if we could replicate that to the flink runner without a perf 
penalty. Your point about writing tests I actually think is an 
argument for removing this as soon as possible, so the prototype 
doesn't blow up in production :P But I think a flag would be best.


@Jan, This will actually be part of my master thesis and so it would 
be my pleasure to perform more benchmarks and share the results. I 
have not had the time yet to test it out, but I agree with you about 
the performance impact being less for more realistic jobs. The 
benchmark I have used here is a worst case scenario. Would continuing 
with the synthetic sources, but changing the processing steps? So far 
all my tests have been with parallelism set to 1 and this is something 
I'm also going to explore more in the thesis.


But am I correct in thinking that it's worth creating a Jira issue 
over this and assigning it to myself? As I said in the original email, 
I think the change is simple enough so that I can implement it and I 
would like to try contributing to Beam.


(Not going to lie, it would also look good on my master thesis ;) )

Best regards,
Teodor Spæren


On Tue, Oct 27, 2020 at 01:53:11PM +0100, Jan Lukavský wrote:

Hi,

I tend to be +1 for the flag, but before that, we might want to have 
a deeper analysis of the performance impact. I believe the penalty 
will be (in percentage) much lower in cases of more practical jobs 
(e.g. having at least one shuffle).


@Teodor, would you be willing to provide us with some measurements 
of jobs doing something more practical, than simple stateless 
mappings? E.g. a few jobs doing 1, 2 and 3 shuffle phases to see 
what is the impact of these more complex scenarios on the 
performance penalty?


Cheers,

 Jan

On 10/27/20 1:24 PM, David Morávek wrote:
you made a really good argument ;) I'm inclined to an experimental 
opt-in flag that would enable this. It would be great if we could 
automatically check for violations - kind of a safety net, for 
mistakes in user code.


Just to note, direct runner enforcement may not cover all cases, 
as it only checks binary representation after serialization. Also 
there are programmers that don't write tests, especially during 
prototyping (not an argument for perf. penalty, but something to 
keep in mind).


Max, WDYT?





On Tue, Oct 27, 2020 at 12:44 PM Teodor Spæren 
mailto:teodor_spae...@riseup.net>> 
wrote:


  Some more thoughts:

  As it says on the DirectRunner [1] page, the DirectRunner is meant to
  check that users don't rely on semantics that are not guaranteed
  by the
  Beam model.

  Programs that rely on the Flink runner deep cloning the inputs
  between
  each operator in the pipeline is relying on a semantic that is not
  guaranteed by the Beam model, and those pipelines would fail if
  ran on
  the DirectRunner.

  As I stated in the previous email, I have some example programs that
  return different outputs on the Flink runner and on the
  DirectRunner. I
  have not tested these programs on other runners, so I don't know what
  they would return. If they return different answers than the
  DirectRunner, I'm inclined to say that the DirectRunner should
  either be
  changed, or the runners be changed.

   From my very limited point of view, the Flink runner seems to be
  spending a lot of extra time implementing a semantic guarantee
  that the
  Beam model explicitly doesn't support.


  Best regards,
  Teodor Spæren

  [1]: https://beam.apache.org/documentation/runners/direct/

  On Tue, Oct 27, 2020 at 12:08:51PM +0100, Teodor Spæren wrote:
  >Hey David,
  >
  >I think I might have worded this poorly, because what I meant is
  that
  >from what I can see in [1], the BEAM model explicitly states that
  >PCollections should be treated as immutable. The direct runner also
  >tests for this. Do the other runners also protect the user from
  >misusing the system so? If not we have a situation where running the
  >same pipeline on two different runners will yield different
  answers. I
  >can show some examples that return different examples for the Flink
  >and the Direct Runner.
  >
  >I agree that a breaking existing pipelines is a no-no, but I do
  think
  >that we could simply gate this behind an option on the Flink 

Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-28 Thread Teodor Spæren

Hey Max!

Just to make sure we are on the same page:

When you say "Flinks default behavior" do you mean Apache Flink the 
project or Beams Flink Runner? I'm assuming the Flink Runner, since 
Apache Flink leaves it up to the TypeSerializer to decide how to copy 
and none of the ones I've seen so far choose to do it through a 
serialization then deserialization method.


Is the bug hard to detect? Using the direct runner will warn of this 
missuse by default, without any need to change the pipeline itself, as 
far as I know? Please correct me if I'm wrong, I don't have much 
experience with Beam!


PCollections being immutable does mean that the input element should not 
be modified, rather if a modification is needed it is up to the user to 
copy the input before changing it [1](3.2.3 Immutability). I think 
this is what you are saying, but I just wanted to make sure :)


Also, I think naming the flag anything with object reuse would be 
confusing to users, as flink already has the concept of object reuse 
[2](enableObjectReuse), and there is an option on the runner mentioning 
this already[3](objectReuse). I'm thinking something more along the 
lines of "fasterCopy" or "disableFailsafeCopying".


Best regards,
Teodor Spæren

[1]: 
https://beam.apache.org/documentation/programming-guide/#pcollection-characteristics
[2]: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/execution_configuration.html
[3]: 
https://beam.apache.org/documentation/runners/flink/#pipeline-options-for-the-flink-runner

On Wed, Oct 28, 2020 at 10:31:51AM +0100, Maximilian Michels wrote:

Very good observation @Teodor!

Flink's default behavior is to copy elements by going through a 
serialization/deserialization roundtrip. This occurs regardless of 
whether operatores are "chained" (directly pass on data without going 
through the network) or not.


This default was chosen for a reason because users tend to forget that 
you must not modify the input. It can cause very hard to detect 
"bugs".


PCollections are immutable but that does not mean transforms are not 
allowed to mutate inputs. Rather it means that the original element 
will not change but a copy of that element.


+1 for a flag to enable object reuse
-1 for changing the default for the above reason

Cheers,
Max

On 27.10.20 21:54, Kenneth Knowles wrote:

It seems that many correct things are said on this thread.

1. Elements of a PCollection are immutable. They should be like 
mathematical values.
2. For performance reasons, the author of a DoFn is responsible to 
not mutate input elements and also to not mutate outputs once they 
have been output.

3. The direct runner does extra work to check if your DoFn* is wrong.
4. On a production runner it is expected that serialization only 
occurs when needed for shipping data**


If the FlinkRunner is serializing things that don't have to be 
shipped that seems like a great easy win.


Kenn

*notably CombineFn has an API that is broken; only the first 
accumulator is allowed to be mutated and a runner is responsible for 
cloning it as necessary; it is expected that combining many elements 
will execute by mutating one unaliased accumulator many times
**small caveat that when doing in-memory groupings you need to use 
Coder#structuralValue and group by that, which may serialize but 
hopefully does something smarter


On Tue, Oct 27, 2020 at 8:52 AM Reuven Lax > wrote:


   Actually I believe that the Beam model does say that input elements
   should be immutable. If I remember correctly, the DirectRunner even
   validates this in unit tests, failing tests if the input elements
   have been mutated.

   On Tue, Oct 27, 2020 at 3:49 AM David Morávek mailto:d...@apache.org>> wrote:

   Hi Teodor,

   Thanks for bringing this up. This is a known, long standing
   "issue". Unfortunately there are few things we need to consider:

   - As you correctly noted, the *Beam model doesn't enforce
   immutability* of input / output elements, so this is the price.
   - We*can not break *existing pipelines.
   - Flink Runner needs to provide the *same guarantees as the Beam
   model*.

   There are definitely some things we can do here, to make things
   faster:

   - We can try the similar approach as HadoopIO
   (HadoopInputFormatReader#isKnownImmutable), to check for known
   immutable types (KV, primitives, protobuf, other known internal
   immutable structures).
   -*If the type is immutable, we can safely reuse it.* This should
   cover most of the performance costs without breaking the
   guarantees Beam model provides.
   - We can enable registration of custom "immutable" types via
   pipeline options? (this may be an unnecessary knob, so this
   needs a further discussion)

   WDYT?

   D.


   On Mon, Oct 26, 2020 at 6:37 PM Teodor Spæren
   mailto:teodor_spae...@riseup.net>>
   wrote:

   

Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-28 Thread Maximilian Michels

Very good observation @Teodor!

Flink's default behavior is to copy elements by going through a 
serialization/deserialization roundtrip. This occurs regardless of 
whether operatores are "chained" (directly pass on data without going 
through the network) or not.


This default was chosen for a reason because users tend to forget that 
you must not modify the input. It can cause very hard to detect "bugs".


PCollections are immutable but that does not mean transforms are not 
allowed to mutate inputs. Rather it means that the original element will 
not change but a copy of that element.


+1 for a flag to enable object reuse
-1 for changing the default for the above reason

Cheers,
Max

On 27.10.20 21:54, Kenneth Knowles wrote:

It seems that many correct things are said on this thread.

1. Elements of a PCollection are immutable. They should be like 
mathematical values.
2. For performance reasons, the author of a DoFn is responsible to not 
mutate input elements and also to not mutate outputs once they have been 
output.

3. The direct runner does extra work to check if your DoFn* is wrong.
4. On a production runner it is expected that serialization only occurs 
when needed for shipping data**


If the FlinkRunner is serializing things that don't have to be shipped 
that seems like a great easy win.


Kenn

*notably CombineFn has an API that is broken; only the first accumulator 
is allowed to be mutated and a runner is responsible for cloning it as 
necessary; it is expected that combining many elements will execute by 
mutating one unaliased accumulator many times
**small caveat that when doing in-memory groupings you need to use 
Coder#structuralValue and group by that, which may serialize but 
hopefully does something smarter


On Tue, Oct 27, 2020 at 8:52 AM Reuven Lax > wrote:


Actually I believe that the Beam model does say that input elements
should be immutable. If I remember correctly, the DirectRunner even
validates this in unit tests, failing tests if the input elements
have been mutated.

On Tue, Oct 27, 2020 at 3:49 AM David Morávek mailto:d...@apache.org>> wrote:

Hi Teodor,

Thanks for bringing this up. This is a known, long standing
"issue". Unfortunately there are few things we need to consider:

- As you correctly noted, the *Beam model doesn't enforce
immutability* of input / output elements, so this is the price.
- We*can not break *existing pipelines.
- Flink Runner needs to provide the *same guarantees as the Beam
model*.

There are definitely some things we can do here, to make things
faster:

- We can try the similar approach as HadoopIO
(HadoopInputFormatReader#isKnownImmutable), to check for known
immutable types (KV, primitives, protobuf, other known internal
immutable structures).
-*If the type is immutable, we can safely reuse it.* This should
cover most of the performance costs without breaking the
guarantees Beam model provides.
- We can enable registration of custom "immutable" types via
pipeline options? (this may be an unnecessary knob, so this
needs a further discussion)

WDYT?

D.


On Mon, Oct 26, 2020 at 6:37 PM Teodor Spæren
mailto:teodor_spae...@riseup.net>>
wrote:

Hey!

I'm a student at the University of Oslo, and I'm writing a
master thesis
about the possibility of using Beam to benchmark stream
processing
systems. An important factor in this is the overhead
associated with
using Beam over writing code for the runner directly. [1]
found that
there was a large overhead associated with using Beam, but
did not
investigate where this overhead came from. I've done
benchmarks and
confirmed the findings there, where for simple chains of
identity
operators, Beam is 43x times slower than the Flink equivalent.

These are very simple pipelines, with custom sources that
just output a
series of integers. By profiling I've found that most of the
overhead
comes from serializing and deserializing. Specifically the way
TypeSerializer's, [2], is implemented in [3], where each
object is
serialized and then deserialized between every operator.
Looking into
the semantics of Beam, no operator should change the input,
so we don't
need to do a copy here. The function in [3] could
potentially be changed
to a single `return` statement.

Doing this removes 80% of the overhead in my tests. This is
a very
synthetic example, but it's a low hanging fruit and