Repository: incubator-airflow Updated Branches: refs/heads/master 9f2c16a0a -> 4dade6db3
[AIRFLOW-1714] Fix misspelling: s/seperate/separate/ Closes #2688 from wrp/separate Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/4dade6db Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/4dade6db Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/4dade6db Branch: refs/heads/master Commit: 4dade6db3d2dc03f9f44dc080acae37ebf03a4fb Parents: 9f2c16a Author: William Pursell <[email protected]> Authored: Fri Oct 13 10:46:57 2017 -0700 Committer: Chris Riccomini <[email protected]> Committed: Fri Oct 13 10:46:57 2017 -0700 ---------------------------------------------------------------------- airflow/contrib/example_dags/example_twitter_README.md | 4 ++-- airflow/contrib/example_dags/example_twitter_dag.py | 2 +- airflow/contrib/hooks/salesforce_hook.py | 4 ++-- airflow/www/views.py | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/4dade6db/airflow/contrib/example_dags/example_twitter_README.md ---------------------------------------------------------------------- diff --git a/airflow/contrib/example_dags/example_twitter_README.md b/airflow/contrib/example_dags/example_twitter_README.md index 03ecc66..319eac3 100644 --- a/airflow/contrib/example_dags/example_twitter_README.md +++ b/airflow/contrib/example_dags/example_twitter_README.md @@ -18,7 +18,7 @@ ***Example Structure:*** In this example dag, we are collecting tweets for four users account or twitter handle. Each twitter handle has two channels, incoming tweets and outgoing tweets. Hence, in this example, by running the fetch_tweet task, we should have eight output files. For better management, each of the eight output files should be saved with the yesterday's date (we are collecting tweets from yesterday), i.e. toTwitter_A_2016-03-21.csv. We are using three kind of operators: PythonOperator, BashOperator, and HiveOperator. However, for this example only the Python scripts are stored externally. Hence this example DAG only has the following directory structure: -The python functions here are just placeholders. In case you are interested to actually make this DAG fully functional, first start with filling out the scripts as seperate files and importing them into the DAG with absolute or relative import. My approach was to store the retrieved data in memory using Pandas dataframe first, and then use the built in method to save the CSV file on hard-disk. +The python functions here are just placeholders. In case you are interested to actually make this DAG fully functional, first start with filling out the scripts as separate files and importing them into the DAG with absolute or relative import. My approach was to store the retrieved data in memory using Pandas dataframe first, and then use the built in method to save the CSV file on hard-disk. The eight different CSV files are then put into eight different folders within HDFS. Each of the newly inserted files are then loaded into eight different external hive tables. Hive tables can be external or internal. In this case, we are inserting the data right into the table, and so we are making our tables internal. Each file is inserted into the respected Hive table named after the twitter channel, i.e. toTwitter_A or fromTwitter_A. It is also important to note that when we created the tables, we facilitated for partitioning by date using the variable dt and declared comma as the row deliminator. The partitioning is very handy and ensures our query execution time remains constant even with growing volume of data. As most probably these folders and hive tables doesn't exist in your system, you will get an error for these tasks within the DAG. If you rebuild a function DAG from this example, make sure those folders and hive tables exists. When you create the table, keep the consideration of table partitioning and declaring comma as the row deliminator in your mind. Furthermore, you may also need to skip headers on each read and ensure that the user under which you have Airflow running has the right permission access. Below is a sample HQL snippet on creating such table: ``` @@ -29,7 +29,7 @@ CREATE TABLE toTwitter_A(id BIGINT, id_str STRING STORED AS TEXTFILE; alter table toTwitter_A SET serdeproperties ('skip.header.line.count' = '1'); ``` -When you review the code for the DAG, you will notice that these tasks are generated using for loop. These two for loops could be combined into one loop. However, in most cases, you will be running different analysis on your incoming incoming and outgoing tweets, and hence they are kept seperated in this example. +When you review the code for the DAG, you will notice that these tasks are generated using for loop. These two for loops could be combined into one loop. However, in most cases, you will be running different analysis on your incoming incoming and outgoing tweets, and hence they are kept separated in this example. Final step is a running the broker script, brokerapi.py, which will run queries in Hive and store the summarized data to MySQL in our case. To connect to Hive, pyhs2 library is extremely useful and easy to use. To insert data into MySQL from Python, sqlalchemy is also a good one to use. I hope you find this tutorial useful. If you have question feel free to ask me on [Twitter](https://twitter.com/EkhtiarSyed) or via the live Airflow chatroom room in [Gitter](https://gitter.im/airbnb/airflow).<p> -Ekhtiar Syed http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/4dade6db/airflow/contrib/example_dags/example_twitter_dag.py ---------------------------------------------------------------------- diff --git a/airflow/contrib/example_dags/example_twitter_dag.py b/airflow/contrib/example_dags/example_twitter_dag.py index d8f89d3..2205b9a 100644 --- a/airflow/contrib/example_dags/example_twitter_dag.py +++ b/airflow/contrib/example_dags/example_twitter_dag.py @@ -127,7 +127,7 @@ hive_to_mysql = PythonOperator( # csv files to HDFS. The second task loads these files from HDFS to respected Hive # tables. These two for loops could be combined into one loop. However, in most cases, # you will be running different analysis on your incoming incoming and outgoing tweets, -# and hence they are kept seperated in this example. +# and hence they are kept separated in this example. # -------------------------------------------------------------------------------- from_channels = ['fromTwitter_A', 'fromTwitter_B', 'fromTwitter_C', 'fromTwitter_D'] http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/4dade6db/airflow/contrib/hooks/salesforce_hook.py ---------------------------------------------------------------------- diff --git a/airflow/contrib/hooks/salesforce_hook.py b/airflow/contrib/hooks/salesforce_hook.py index 0d0a104..b82e4ca 100644 --- a/airflow/contrib/hooks/salesforce_hook.py +++ b/airflow/contrib/hooks/salesforce_hook.py @@ -130,7 +130,7 @@ class SalesforceHook(BaseHook, LoggingMixin): return [f['name'] for f in desc['fields']] def _build_field_list(self, fields): - # join all of the fields in a comma seperated list + # join all of the fields in a comma separated list return ",".join(fields) def get_object_from_salesforce(self, obj, fields): @@ -204,7 +204,7 @@ class SalesforceHook(BaseHook, LoggingMixin): Acceptable formats are: - csv: - comma-seperated-values file. This is the default format. + comma-separated-values file. This is the default format. - json: JSON array. Each element in the array is a different row. - ndjson: http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/4dade6db/airflow/www/views.py ---------------------------------------------------------------------- diff --git a/airflow/www/views.py b/airflow/www/views.py index 81ee61f..9ae93f6 100644 --- a/airflow/www/views.py +++ b/airflow/www/views.py @@ -2612,7 +2612,7 @@ class ConnectionModelView(wwwutils.SuperUserMixin, AirflowModelView): 'extra__google_cloud_platform__project': StringField('Project Id'), 'extra__google_cloud_platform__key_path': StringField('Keyfile Path'), 'extra__google_cloud_platform__keyfile_dict': PasswordField('Keyfile JSON'), - 'extra__google_cloud_platform__scope': StringField('Scopes (comma seperated)'), + 'extra__google_cloud_platform__scope': StringField('Scopes (comma separated)'), } form_choices = {
