iblaine opened a new pull request #17628:
URL: https://github.com/apache/airflow/pull/17628
Improves get_pandas_df() in HiveServer2Hook by properly adding columns when
an empty dataframe is encountered.
Currently in hive hooks, when get_pandas_df() is used to create a dataframe,
the next step is to add columns to the existing dataframe object. pandas
throws an exception when trying to add columns to an empty dataframe with no
columns. By moving adding columns to the step where the dataframe is created,
we can avoid throwing an exception on empty dataframes.
Current behavior using get_pandas_df() to read and an empty table:
```
hh = HiveServer2Hook()
sql = "SELECT * FROM <table> WHERE 1=0"
df = hh.get_pandas_df(sql)
[2021-08-15 21:10:15,282] {{hive.py:449}} INFO - SELECT * FROM <table> WHERE
1=0
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File
"/venv/lib/python3.7/site-packages/airflow/providers/apache/hive/hooks/hive.py",
line 1073, in get_pandas_df
df.columns = [c[0] for c in res['header']]
File "/venv/lib/python3.7/site-packages/pandas/core/generic.py", line
5154, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in
pandas._libs.properties.AxisProperty.__set__
File "/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 564,
in _set_axis
self._mgr.set_axis(axis, labels)
File
"/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line
227, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 0 elements, new values have 1
elements
```
New behavior w/this PR
```
hh = HiveServer2Hook()
sql = "SELECT * FROM <table> WHERE 1=0"
df = hh.get_pandas_df(sql)
len(df.index)
0
```
I need help testing this against the `airflow.static_babynames` hive table,
if that test is needed. Also curious how to set that up. I have tested this
locally against my own hive server & it is working as expected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]