*TL;DR; I am asking for opinions on some of the changes required to
introduce implicit native package support. This is the easiest way to make
backports of master providers packages to 1.10.x. *

NOTE!: @Ash Berlin-Taylor <[email protected]> @Fokko Driesprong
<[email protected]> @Felix Uellendall <[email protected]> @Kamil Breguła
<[email protected]> @Kaxil Naik <[email protected]>
@[email protected] <[email protected]> @Philippe
Gagnon <[email protected]>  - there is one point 1a) below that I would
like to get your input.

As AIP-21 (import paths) import path move is done we have an important step
to complete - we should switch to implicit "native" packages. It boils down
to removal of empty (only with comment licences) __init__.py files and
relying on python 3.3+ capability of finding packages without having to add
__init_.py files. This is needed in order to get backported packaged to be
prepared for 1.10.* series. I worked on it during the last few days - to
make sure this can be done and I am close to have it. Close enough to know
I can solve all the remaining problems. I wanted to check that we can do it
for almost all the packages. I already solved most of the initial
problems but I have some places where I think I need community opinion on
the way we should solve them:

The PR that has the changes is here:
https://github.com/apache/airflow/pull/7279/files - it's not 100% ready yet
and I want to split it into a few separate PRs so do not comment it there
yet - I extracted the most important changes here and wanted to ask your
opinion where I have doubts:

*1) Module names identical to some top-level package names (for example
email.py):*

Some of the module names (for example email.py) has to be changed. The
problem with implicit packages is that if local module is named the same as
the top level package, importing the top-level package from it does not
work:

email.py:
...
import email <- this won't work - module imports itself.

For now in my PR i renamed the modules to be airflow_<old_name>.py or
<old_name>_utils.py : There are  such modules (mostly hooks):

   - email -> email_utils, (utils)
   - cassandra -> airflow_cassandra
   - cloudant -> airflow_cloudant
   - dataproc -> airflow_dataproc
   - grpc -> airlfow_grpc
   - hdfs -> airflow_hdfs
   - jira -> airflow_jira
   - kerberos -> kerberos_security
   - redis -> airflow_redis
   - sendgrid -> airflow_sendgrid
   - snowflake -> airflow_snowflake
   - winrm -> airflow_winrm (operator)
   - datadog -> airflow_datatog(sensor)
   - sqlalchemy -> sqlalchemy_utils (utils)

I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or maybe
someone knows an easy way how to keep the module name and implicit
packages?

I guess this is not a big problem to change it this way - we anyhow have
changed import paths for those. The only two problems with this are:
a) email was used in "email_backend" configuration : email_backend =
airflow.utils.email.send_email_smtp
but we can solve it by handling that as special case, raising deprecation
warning and converting it to  airflow.utils.email_utils.send_email_smtp
b) we introduce slight inconsistency in hook/operator/sensor names.

*2) Mypy fails when checking individual (duplicated) module names*

Mypy does not work well with duplicate module names in implicit packages (
https://github.com/pre-commit/mirrors-mypy/issues/5) and with AIP-21 we
have plenty of them. Repeating modules (http.py in both operators and hooks
for example) cause it to fail in case they are provided as parameters (in
case of incremental mypy check in pre-commits. I had to change it then so
that mypy always check 'airflow' and 'test' packages instead - which makes
pre-commit check slightly slower (few seconds).

*3) _hooks.py/_operators.py/_sensors.py
<http://hooks.py/_operators.py/_sensors.py> suffix: Some thought resulting
from above 1) and 2): *

I did not want to open pandora's box again but I think removal of _hooks,
_operators, _sensors from the module name might not have been the best
decision. While it removes some duplication in module name, it actually
introduces duplicate module names and (as it is with mypy) it might be a
problem in the future. I think now we should change it back to add the
suffixes. If we want to reverse it - this is the last moment we can do it
(and we still can easily). I would love some quick opinion/voting for that
- especially those who voted in AIP-21 but others are welcome as well: @Ash
Berlin-Taylor <[email protected]> @Fokko Driesprong <[email protected]> @Felix
Uellendall <[email protected]> @Kamil Breguła <[email protected]> @Kaxil
Naik <[email protected]> @[email protected]
<[email protected]> @Philippe Gagnon <[email protected]>  -
this was "Case 2" in the original AIP-21 proposal.

WDYT?

*4) Test module names:*

I had to change all test module name for providers. For example: . Pytest
does not like repeating implicit test module names. It throws error
"duplicate test module). For example:
tests/providers/http/hooks/test_http.py →
tests/providers/http/hooks/test_http_hooks.py
<https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908>


(basically I added hooks/operators/sensors everywhere). It is easy and
fully automated and it has no impact at all on the main code base, so I
think this is a super-safe change. t The nice thing is that it makes it
easier to understand what kind of tests you are looking at immediately and.
That also hints that maybe _hooks.py, _operators.py, _sensors.py in
hooks/operators/sensors could have been a better choice (see 2 a) )

*5) Doc generation*

Autoapi doc generation has not yet released implicit package support (it is
in master only). However I managed to monkey-patch 1.2.1 (latest) version
to add support by cherry-picking the changes (there are literally few
methods changed there -
https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new
version is released we will be able to get rid of monkey-patching . I am
still resolving a few import issues in that newest version but I am
confident I can fix or workaround all issues with it.
I don't expect this to be a problem - it is only in the doc generation
code, not the main application code.

I verified that pytest test discovery works fine, and that airflow can be
installed and works with implicit packages. Let me know what you think
about it and I would love to merge it as soon as possible (rebasing this
for a long time might be painful).

J.





On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <[email protected]>
wrote:

> I've read a bit more about PEP-0420 so some more extracted information:
>
> - implicit namespace packages are very similar to regular packages. The
> notable difference is that they can be loaded from several directories (so
> you can have modules in the same package but coming from different physical
> directories), so no __file__ is defined for the package (it is still
> defined for modules),
>
> - pytest should have no problems. It has indeed
> <https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages>
> problems with implicit packages when tests are co-located with their tested
> modules (module.py, module_test.py) but we are not in this situation.
>
> - I do not see how it could be slower - I have not found any
> benchmarking yet (but I have not found any complaints about it either). I
> looked at the loading algorithm for PEP-420 - it traverses all the
> directory structure available on the PYTHONPATH bo so does the "standard"
> mechanism. And we would not have to parse and load all the 140 licence-only
> __init__.py. The presence of module in the package is based on the presence
> of "<module>.py" in the right folder. If we would have several packages
> elsewhere that might become a bit slower, but I think reading files is so
> much slower than scanning directories that - if anything - we will be
> faster (just opening and closing those __init__.files will take quite some
> time :). We can do some benchmarking before and after to be sure.
>
> Side comment: I am not 100% sure, but after reading PEP-420  intuition
> tells me that maybe implicit packages can help in solving relative imports
> problems that some of our user raised . I saw recently several users had
> problems with relative imports used in DAGs. We are loading the DAGs in a
> specific way and in implicit packages, the __repr__ of modules parent is
> dynamic - based on the location of the file rather than based on the module
> of the file, so it's likely if we encourage dag developers also to use
> implicit packages, it might actually solve the relative imports case.
>
> J.
>
> On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <[email protected]>
> wrote:
>
>> Fine by me if it doesn’t break anything. I’m always on the side of less
>> code :).
>>
>> via Newton Mail [
>> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
>> ]
>> On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <[email protected]>
>> wrote:
>> Probably worth it, but things to check:
>>
>> - if nose or pytest needs them under tests/ to find all the tests
>> - if not having the init files slows down pythons package importer as it
>> searches more paths?
>>
>> -a
>>
>> > On 22 Nov 2019, at 12:06, Jarek Potiuk <[email protected]>
>> wrote:
>> >
>> > There is one implementation detail in AIP-21 that I continue to have
>> > questions about, and I wanted to ask community about it.
>> >
>> > Since we moved to Python 3, we have the option of using implicit
>> namespace
>> > packages.
>> > https://www.python.org/dev/peps/pep-0420/ . We have now grouping per
>> folder
>> > for services, so it would require a lot more __init__.py files if we
>> > continue using them.
>> >
>> > The implicit naming boils down to not requiring __init__.py files if no
>> > initialisation of package is needed. We could do it consistently for all
>> > code which is "internal" to airflow. This means that most packages will
>> not
>> > have __init__.py there will be few exceptions like 'airflow',
>> > 'airflow.modules' and likely few others with the __init__.py.
>> >
>> > I did a quick check and we have only 25 _init_.py with some logic -
>> > remaining ~ 140 could be removed as they only contain comments.
>> >
>> > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
>> > 25
>> > find . -name '__init__.py' |wc -l
>> > 164
>> >
>> > Those are the files that will be left:
>> > ./tests/__init__.py
>> > ./tests/utils/log/elasticmock/__init__.py
>> > ./tests/utils/log/elasticmock/utilities/__init__.py
>> > ./tests/models/__init__.py
>> > ./tests/task/__init__.py
>> > ./airflow/sensors/__init__.py
>> > ./airflow/operators/__init__.py
>> > ./airflow/_vendor/nvd3/__init__.py
>> > ./airflow/_vendor/slugify/__init__.py
>> > ./airflow/serialization/__init__.py
>> > ./airflow/__init__.py
>> > ./airflow/models/__init__.py
>> >
>> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
>> > ./airflow/ti_deps/deps/__init__.py
>> > ./airflow/macros/__init__.py
>> > ./airflow/executors/__init__.py
>> > ./airflow/lineage/__init__.py
>> > ./airflow/lineage/backend/__init__.py
>> > ./airflow/lineage/backend/atlas/__init__.py
>> > ./airflow/hooks/__init__.py
>> > ./airflow/task/task_runner/__init__.py
>> > ./airflow/api/__init__.py
>> > ./airflow/api/common/experimental/__init__.py
>> > ./airflow/api/client/__init__.py
>> > ./airflow/jobs/__init__.py
>> >
>> > WDYT?
>> >
>> > J
>> >
>> > --
>> >
>> > Jarek Potiuk
>> > Polidea <https://www.polidea.com/> | Principal Software Engineer
>> >
>> > M: +48 660 796 129 <+48660796129>
>> > [image: Polidea] <https://www.polidea.com/>
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to