dstandish commented on PR #64571:
URL: https://github.com/apache/airflow/pull/64571#issuecomment-4577677517
Hey @Lee-W
I was a little surprised to see so many lines for this change so I asked
Claude to help me review whether there's any unnecessary complexity. Below are
its findings. They seem plausible. What do you think?
---
The core design is clean (the Window types are tiny generators,
RollupMapper.to_upstream is a clear decode→expand→encode, and the
scheduler gate comes down to one
expected.issubset(actual) check), and the test coverage on the scheduler
paths is excellent. The
HA hardening in _create_dagruns_for_partitioned_asset_dags (bulk-fetch
over the old N+1,
with_row_locks(skip_locked=True), the per-tick cap and deterministic
order_by) all looks right.
One theme I'd like to resolve before merge: a meaningful chunk of
complexity is self-induced,
concentrated in partition_mappers/temporal.py.
_compile_output_format_regex looks like more generality than the feature
needs. Only two mappers
can't round-trip through strptime: StartOfWeekMapper (%V isn't
strptime-parseable) and
StartOfQuarterMapper ({quarter} isn't a directive). The actual requirement
is "recover Y/m/d" and
"recover Y/quarter". But the decode that motivated the whole compiler ends
up discarding the token
it was built for:
# StartOfWeekMapper.decode_downstream — %V is captured but never read
return datetime(int(match["Y"]), int(match["m"]), int(match["d"]))
Both decodes collapse to a dedicated per-mapper regex:
_WEEK_RE = re.compile(r"(?P<Y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})")
_QUARTER_RE = re.compile(r"(?P<Y>\d{4})-Q(?P<quarter>[1-4])")
That drops the directive table, the {name} placeholder machinery, the
placeholder_patterns arg,
and the five compile-time ValueError branches — plus the test class that
only exists to exercise
that invented surface (test_rejects_adjacent_default_pattern_placeholders,
test_adjacent_placeholders_allowed_when_one_is_narrowed,
test_separator_between_default_placeholders_is_allowed). Roughly ~120
lines + tests, with
identical behavior for every shipped mapper. Could we go with the
dedicated regexes for now and
add generality if/when a custom mapper actually needs it?
Follow-on: if the temporal decode simplifies, the base-class guard
scaffolding in
partition_mappers/base.py — the __init_subclass__ XOR check,
expected_decoded_type, and the
runtime pairing check in RollupMapper.__init__ — largely loses its purpose
for the shipped
mappers, since it mostly exists to make the general decode/encode pair
safe. Worth reassessing
whether it belongs now or arrives with the first real custom mapper.
Question (non-blocking): the audit-log path (_record_partition_audit_log /
_record_stale_apdr_audit_log) opens independent scoped=False sessions to
survive an outer
rollback, plus a process-lifetime dedup set. For advisory Log rows where
self.log.exception(...)
already records the same thing, is the rollback-survival guarantee worth
the machinery for v1, or
would a plain log line do?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]