Brian Hulette created BEAM-13266:
------------------------------------
Summary: DataFrame API errors should identify culprit operation in
user code
Key: BEAM-13266
URL: https://issues.apache.org/jira/browse/BEAM-13266
Project: Beam
Issue Type: Improvement
Components: dsl-dataframe
Reporter: Brian Hulette
The DataFrame API aims to catch errors in pipeline code at pipeline
construction time as much as possible. Ideally, flawed user code will be caught
during proxy generation and bubble up an error from pandas.
However, there are edge cases where DataFrame operations validate at
construction time, but still produce errors at execution time, based on the
actual data. For example a user might try to use the modulo operator on a
string column. This is a valid operation, but it performs string
interpolation, not a modulo as the user intended.
The above situation will raise an error at execution time, but it has a very
obtuse stacktrace, with a tree of evaluate/evaluate_at calls. The culprit from
the user's code is nowhere in the astacktrace.
We should catch errors like this at execution time and add a pointer to the
line in the user's
code that created this expression. We'll likely need to add this metadata to
DataFrame expressions.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)