[ https://issues.apache.org/jira/browse/AIRFLOW-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Anand resolved AIRFLOW-2452. -------------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request #3347 [https://github.com/apache/incubator-airflow/pull/3347] > Document field_dict for HiveCliHook.load_file must be OrderedDict > ----------------------------------------------------------------- > > Key: AIRFLOW-2452 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2452 > Project: Apache Airflow > Issue Type: Improvement > Components: docs, Documentation, hive_hooks, hooks > Reporter: Kengo Seki > Assignee: Kengo Seki > Priority: Major > Fix For: 2.0.0 > > > HiveCliHook.load_file has a parameter called field_dict, which defines > name-type pairs for columns, must be OrderedDict. If not, users can get > unexpected result. Example: > Given the following input file: > {code} > $ head /tmp/baby_names.csv > 1880,John,0.081541,boy > 1880,William,0.080511,boy > 1880,James,0.050057,boy > 1880,Charles,0.045167,boy > 1880,George,0.043292,boy > 1880,Frank,0.02738,boy > 1880,Joseph,0.022229,boy > 1880,Thomas,0.021401,boy > 1880,Henry,0.020641,boy > {code} > Load the file via HiveCliHook.load_file with field_dict as a normal dict: > {code} > In [1]: from airflow.hooks.hive_hooks import HiveCliHook > In [2]: hook = HiveCliHook() > [2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: > localhost > In [3]: field_dict = { > ...: "year": "INT", > ...: "name": "STRING", > ...: "pct": "DOUBLE", > ...: "sex": "STRING", > ...: } > In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", > field_dict=field_dict, recreate=True) > [2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS > baby_names; > CREATE TABLE IF NOT EXISTS baby_names ( > sex STRING, > name STRING, > pct DOUBLE, > year INT) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS textfile > ; > (snip) > [2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names > stats: [numFiles=1, numRows=0, totalSize=1289, rawDataSize=0] > [2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK > [2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds > {code} > The file is loaded, but fields in the CREATE TABLE statement are disordered. > So the loaded data is not correctly selected from Hive: > {code} > hive> SELECT * FROM baby_names LIMIT 10; > OK > 1880 John 0.081541 NULL > 1880 William 0.080511 NULL > 1880 James 0.050057 NULL > 1880 Charles 0.045167 NULL > 1880 George 0.043292 NULL > 1880 Frank 0.02738 NULL > 1880 Joseph 0.022229 NULL > 1880 Thomas 0.021401 NULL > 1880 Henry 0.020641 NULL > 1880 Robert 0.020404 NULL > Time taken: 2.465 seconds, Fetched: 10 row(s) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)