iblaine opened a new pull request #17628:
URL: https://github.com/apache/airflow/pull/17628


   Improves get_pandas_df() in HiveServer2Hook by properly adding columns when 
an empty dataframe is encountered.
   
   Currently in hive hooks, when get_pandas_df() is used to create a dataframe, 
the next step is to add columns to the existing dataframe object.  pandas 
throws an exception when trying to add columns to an empty dataframe with no 
columns.  By moving adding columns to the step where the dataframe is created, 
we can avoid throwing an exception on empty dataframes.
   
   Current behavior using get_pandas_df() to read and an empty table:
   ```
   hh = HiveServer2Hook()
   sql = "SELECT * FROM <table> WHERE 1=0"
   df = hh.get_pandas_df(sql)
   
   [2021-08-15 21:10:15,282] {{hive.py:449}} INFO - SELECT * FROM <table> WHERE 
1=0
   Traceback (most recent call last):
     File "<stdin>", line 2, in <module>
     File 
"/venv/lib/python3.7/site-packages/airflow/providers/apache/hive/hooks/hive.py",
 line 1073, in get_pandas_df
       df.columns = [c[0] for c in res['header']]
     File "/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 
5154, in __setattr__
       return object.__setattr__(self, name, value)
     File "pandas/_libs/properties.pyx", line 66, in 
pandas._libs.properties.AxisProperty.__set__
     File "/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 564, 
in _set_axis
       self._mgr.set_axis(axis, labels)
     File 
"/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 
227, in set_axis
       f"Length mismatch: Expected axis has {old_len} elements, new "
   ValueError: Length mismatch: Expected axis has 0 elements, new values have 1 
elements
   ```
   
   New behavior w/this PR
   ```
   hh = HiveServer2Hook()
   sql = "SELECT * FROM <table> WHERE 1=0"
   df = hh.get_pandas_df(sql)
   len(df.index)
   0
   ```
   
   I need help testing this against the `airflow.static_babynames` hive table, 
if that test is needed. Also curious how to set that up.  I have tested this 
locally against my own hive server & it is working as expected.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to