[ 
https://issues.apache.org/jira/browse/AIRFLOW-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158839#comment-16158839
 ] 

Siddharth commented on AIRFLOW-1560:
------------------------------------

Something to think about - Performance
I think in general the approach for reading all data from source and then 
transferring to the destination is not the best (in cases of big data with 
potential performance issues). When I started my intial design - I categorized 
it into 2 use cases a) small data b) big data.. The operators written in 
airflow code right now mostly handle small data (so does this PR). My approach 
would be to start with this operator (small data) and then handle big data use 
case if we see performance issues. I have some thoughts on how we can handle 
big data operations (using batching mechanism). For instance:

Fetch all distinct id's (primary key) from source
Create N buckets and assign bucket number to all the id's
For each bucket : fetch data from source and then insert into destination.
Plan to handle this in subsequent PR.

> Add AWS DynamoDB hook for inserting batch items
> -----------------------------------------------
>
>                 Key: AIRFLOW-1560
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1560
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: aws, boto3, hooks
>            Reporter: Siddharth
>            Assignee: Siddharth
>
> The PR addresses airflow integration with AWS Dynamodb.  
> Currently there is no hook to interact with DynamoDb for reading or writing 
> items (single or batch insertions). To get started, we want to push data in 
> DynamoDB using airflow jobs (scheduled daily). Idea is to read aggregates 
> from Hive and push in DynamoDB (write data job will run everyday to make this 
> happen). First we want to create DynamoDB hooks (this PR addressed the same) 
> and then create operator to move data from Hive to DynamoDB (added hive to 
> dynamo transfer operator)
> I noticed that currently airflow has AWS_HOOK (parent hook for connecting to 
> AWS using credentials stored in configs). It has a function to connect to AWS 
> objects using Client API 
> (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#client)
>  which is specific to EMR_HOOK. But in case of inserting data we can use 
> DynamoDB Resource API 
> (http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#service-resource)
>  which provides higher level abstractions for inserting data in DynamoDB). 
> One good question to ask can be difference between client and resource and 
> why use one or the other? "Resources are higher-level abstraction than the 
> raw, low-level calls made by service clients. They can't do anything the 
> clients can't do, but in many cases they are nicer to use. The downside is 
> that they don't always support 100% of the features of a service." 
> (http://boto3.readthedocs.io/en/latest/guide/resources.html) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to