gmarendaz opened a new issue, #35979:
URL: https://github.com/apache/airflow/issues/35979
### Apache Airflow version
2.7.3
### What happened
I was in part of the migration of Apache Airflow in a Ubuntu environment
from a Windows WSL environment.
DAG was working correctly before on Windows WSL but not on Ubuntu.
### What you think should happen instead
The code should work as expected because the file and DAG structure did not
change.
### How to reproduce
The problem can't be reprocued as it is in my environment only
### Operating System
Ubuntu 22.04.3 LTS
### Versions of Apache Airflow Providers
ai-operator @
file:///home/apache/airflow/packages/ai_package/dist/ai_operator-0.0.0-py3-none-any.whl
aiohttp==3.8.6
aiosignal==1.3.1
alembic==1.12.1
annotated-types==0.6.0
anyio==4.0.0
apache-airflow==2.7.3
apache-airflow-providers-common-sql==1.8.0
apache-airflow-providers-ftp==3.6.0
apache-airflow-providers-http==4.6.0
apache-airflow-providers-imap==3.4.0
apache-airflow-providers-mysql==5.2.1
apache-airflow-providers-sqlite==3.5.0
apispec==6.3.0
argcomplete==3.1.3
asgiref==3.7.2
async-timeout==4.0.3
attrs==23.1.0
Automat==20.2.0
Babel==2.13.1
backoff==1.10.0
bcrypt==3.2.0
blinker==1.6.3
cachelib==0.9.0
cattrs==23.1.2
certifi==2023.7.22
cffi==1.16.0
chardet==4.0.0
charset-normalizer==3.3.2
click==8.1.7
clickclick==20.10.2
cloud-init==23.2.2
cmake==3.27.1
colorama==0.4.6
colorlog==4.8.0
command-not-found==0.3
configobj==5.0.6
ConfigUpdater==3.1.1
connexion==2.14.2
constantly==15.1.0
contourpy==1.1.0
cron-descriptor==1.4.0
croniter==2.0.1
cryptography==41.0.5
cycler==0.11.0
dbus-python==1.2.18
Deprecated==1.2.14
dill==0.3.1.1
distro==1.7.0
distro-info==1.1+ubuntu0.1
dnspython==2.4.2
docutils==0.20.1
email-validator==1.3.1
exceptiongroup==1.1.2
filelock==3.12.2
Flask==2.2.5
Flask-AppBuilder==4.3.6
Flask-Babel==2.0.0
Flask-Caching==2.1.0
Flask-JWT-Extended==4.5.3
Flask-Limiter==3.5.0
Flask-Login==0.6.3
Flask-Session==0.5.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==1.2.1
fonttools==4.42.0
frozenlist==1.4.0
gevent==23.7.0
google-re2==1.1
googleapis-common-protos==1.61.0
graphviz==0.20.1
greenlet==3.0.1
grpcio==1.59.2
gunicorn==21.2.0
h11==0.14.0
httpcore==0.16.3
httplib2==0.20.2
httpx==0.23.3
hyperlink==21.0.0
idna==3.4
importlib-metadata==6.8.0
importlib-resources==6.1.0
imutils==0.5.4
incremental==21.3.0
inflection==0.5.1
itsdangerous==2.1.2
jeepney==0.7.1
Jinja2==3.1.2
jsonpatch==1.32
jsonpointer==2.0
jsonschema==4.19.2
jsonschema-specifications==2023.7.1
keyring==23.5.0
kiwisolver==1.4.4
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy-object-proxy==1.9.0
limits==3.6.0
linkify-it-py==2.0.2
lit==16.0.6
lockfile==0.12.2
Mako==1.2.4
Markdown==3.5.1
markdown-it-py==3.0.0
MarkupSafe==2.1.3
marshmallow==3.20.1
marshmallow-enum==1.5.1
marshmallow-oneofschema==3.0.1
marshmallow-sqlalchemy==0.26.1
matplotlib==3.7.2
mdit-py-plugins==0.4.0
mdurl==0.1.2
more-itertools==8.10.0
mpmath==1.3.0
multidict==6.0.4
mutils==1.0.5
mysql-connector-python==8.1.0
mysqlclient==2.1.1
netifaces==0.11.0
networkx==3.1
numpy==1.25.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.0
opencv-python==4.8.0.76
opentelemetry-api==1.20.0
opentelemetry-exporter-otlp==1.20.0
opentelemetry-exporter-otlp-proto-common==1.20.0
opentelemetry-exporter-otlp-proto-grpc==1.20.0
opentelemetry-exporter-otlp-proto-http==1.20.0
opentelemetry-proto==1.20.0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
ordered-set==4.1.0
packaging==23.2
pandas==2.0.3
pathspec==0.11.2
pendulum==2.1.2
pexpect==4.8.0
Pillow==10.0.0
pluggy==1.3.0
prison==0.2.1
protobuf==4.24.4
psutil==5.9.6
ptyprocess==0.7.0
pyasn1==0.4.8
pyasn1-modules==0.2.1
pycparser==2.21
pydantic==2.4.2
pydantic_core==2.10.1
Pygments==2.16.1
PyGObject==3.42.1
PyHamcrest==2.0.2
PyJWT==2.8.0
pyOpenSSL==21.0.0
pyparsing==3.0.9
pyrsistent==0.18.1
pyserial==3.5
python-apt==2.4.0+ubuntu2
python-daemon==3.0.1
python-dateutil==2.8.2
python-debian==0.1.43+ubuntu1.1
python-magic==0.4.24
python-nvd3==0.15.0
python-slugify==8.0.1
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
referencing==0.30.2
requests==2.31.0
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986==1.5.0
rich==13.6.0
rich-argparse==1.4.0
rpds-py==0.10.6
scipy==1.11.1
seaborn==0.12.2
SecretStorage==3.3.1
service-identity==18.1.0
setproctitle==1.3.3
six==1.16.0
sniffio==1.3.0
sos==4.5.6
SQLAlchemy==1.4.50
SQLAlchemy-JSONField==1.0.1.post0
SQLAlchemy-Utils==0.41.1
sqlparse==0.4.4
ssh-import-id==5.11
sympy==1.12
systemd-python==234
tabulate==0.9.0
tenacity==8.2.3
termcolor==2.3.0
text-unidecode==1.3
torch==2.0.1
torchvision==0.15.2
tqdm==4.66.1
triton==2.0.0
Twisted==22.1.0
typing_extensions==4.8.0
tzdata==2023.3
ubuntu-advantage-tools==8001
ubuntu-drivers-common==0.0.0
uc-micro-py==1.0.2
ufw==0.36.1
unattended-upgrades==0.1
unicodecsv==0.14.1
urllib3==1.26.18
wadllib==1.3.6
Werkzeug==2.2.3
wrapt==1.15.0
WTForms==3.0.1
xkit==0.0.0
xlrd==2.0.1
yarl==1.9.2
zipp==3.17.0
zope.event==5.0
zope.interface==5.4.0
### Deployment
Virtualenv installation
### Deployment details
- Miniconda latest version
- Apache Airflow 2.7.3
### Anything else
Full traceback :
*** Found local files:
*** *
/home/apache/airflow/logs/dag_id=MASP_Slave/run_id=manual__2023-11-28T00:01:58.619294+00:00/task_id=transform/attempt=1.log
[2023-11-30, 12:29:58 CET] {taskinstance.py:1159} INFO - Dependencies all
met for dep_context=non-requeueable deps ti=<TaskInstance: MASP_Slave.transform
manual__2023-11-28T00:01:58.619294+00:00 [queued]>
[2023-11-30, 12:29:58 CET] {taskinstance.py:1159} INFO - Dependencies all
met for dep_context=requeueable deps ti=<TaskInstance: MASP_Slave.transform
manual__2023-11-28T00:01:58.619294+00:00 [queued]>
[2023-11-30, 12:29:58 CET] {taskinstance.py:1361} INFO - Starting attempt 1
of 1
[2023-11-30, 12:29:58 CET] {taskinstance.py:1382} INFO - Executing
<Task(PythonOperator): transform> on 2023-11-28 00:01:58.619294+00:00
[2023-11-30, 12:29:58 CET] {standard_task_runner.py:57} INFO - Started
process 3168332 to run task
[2023-11-30, 12:29:58 CET] {standard_task_runner.py:84} INFO - Running:
['airflow', 'tasks', 'run', 'MASP_Slave', 'transform',
'manual__2023-11-28T00:01:58.619294+00:00', '--job-id', '890530', '--raw',
'--subdir', 'DAGS_FOLDER/manufacturing/tests/MASP_Slave.py', '--cfg-path',
'/tmp/tmp6usvkgvc']
[2023-11-30, 12:29:58 CET] {standard_task_runner.py:85} INFO - Job 890530:
Subtask transform
[2023-11-30, 12:29:59 CET] {task_command.py:416} INFO - Running
<TaskInstance: MASP_Slave.transform manual__2023-11-28T00:01:58.619294+00:00
[running]> on host clb-34a01
[2023-11-30, 12:29:59 CET] {taskinstance.py:1662} INFO - Exporting env vars:
AIRFLOW_CTX_DAG_OWNER='IMEDA' AIRFLOW_CTX_DAG_ID='MASP_Slave'
AIRFLOW_CTX_TASK_ID='transform'
AIRFLOW_CTX_EXECUTION_DATE='2023-11-28T00:01:58.619294+00:00'
AIRFLOW_CTX_TRY_NUMBER='1'
AIRFLOW_CTX_DAG_RUN_ID='manual__2023-11-28T00:01:58.619294+00:00'
[2023-11-30, 12:29:59 CET] {taskinstance.py:1937} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/apache/.local/lib/python3.10/site-packages/airflow/models/xcom.py", line
681, in _deserialize_value
return pickle.loads(result.value)
_pickle.UnpicklingError: pickle data was truncated
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/apache/.local/lib/python3.10/site-packages/airflow/operators/python.py",
line 192, in execute
return_value = self.execute_callable()
File
"/home/apache/.local/lib/python3.10/site-packages/airflow/operators/python.py",
line 209, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/apache/airflow/dags/manufacturing/tests/MASP_Slave.py", line
108, in transform
extracting_res = [output for output in task_outputs if output is not
None]
File "/home/apache/airflow/dags/manufacturing/tests/MASP_Slave.py", line
108, in <listcomp>
extracting_res = [output for output in task_outputs if output is not
None]
File
"/home/apache/.local/lib/python3.10/site-packages/airflow/models/xcom.py", line
720, in __next__
return XCom.deserialize_value(next(self._it))
File
"/home/apache/.local/lib/python3.10/site-packages/airflow/models/xcom.py", line
693, in deserialize_value
return BaseXCom._deserialize_value(result, False)
File
"/home/apache/.local/lib/python3.10/site-packages/airflow/models/xcom.py", line
683, in _deserialize_value
return json.loads(result.value.decode("UTF-8"), cls=XComDecoder,
object_hook=object_hook)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
[2023-11-30, 12:29:59 CET] {taskinstance.py:1400} INFO - Marking task as
FAILED. dag_id=MASP_Slave, task_id=transform, execution_date=20231128T000158,
start_date=20231130T112958, end_date=20231130T112959
[2023-11-30, 12:29:59 CET] {standard_task_runner.py:104} ERROR - Failed to
execute job 890530 for task transform ('utf-8' codec can't decode byte 0x80 in
position 0: invalid start byte; 3168332)
[2023-11-30, 12:29:59 CET] {local_task_job_runner.py:228} INFO - Task exited
with return code 1
[2023-11-30, 12:29:59 CET] {taskinstance.py:2778} INFO - 0 downstream tasks
scheduled from follow-on schedule check
The function where it fails :
def transform(**kwargs):
ti = kwargs['ti']
task_outputs = ti.xcom_pull(task_ids=["old_extract", "new_extract"])
extracting_res = [output for output in task_outputs if output is not
None]
df = extracting_res[0]
df = df.rename(columns={"data_n":"info_data_n"})
schema = _template_slave.get_schema("test_wafer")
df = concat_time_date(df)
df = _template_slave.filter_(df, schema)
df = _template_slave.cast(df, schema)
df["location"] = kwargs["dag_run"].conf['location']
df["src_path"] = kwargs["dag_run"].conf['path']
return df
The function where the data comes from :
def old_extract(**kwargs):
path = kwargs["dag_run"].conf['path']
header_size = _template_slave.header(path, "Puce N°")
raw_df = pd.read_csv(path, header=header_size, sep="\t",
encoding='iso-8859-1', engine='python')
df = raw_df[raw_df.count(axis=1) > 6][1:]
df = get_header_fields(path, df, comment = "")
str_columns = [col for col in df.columns if isinstance(df[col], str)]
df[str_columns] = df[str_columns].rename(columns=str.lower)\
.rename(columns=_template_slave.remove_accents)\
.rename(columns=_template_slave.remove_special_characters)
return df
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]