[jira] [Commented] (AIRFLOW-3185) Add chunking to DBAPI_hook by implementing fetchmany and pandas chunksize

Thomas Haederle (JIRA) Wed, 10 Oct 2018 15:05:42 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645607#comment-16645607
 ]


Thomas Haederle commented on AIRFLOW-3185:
------------------------------------------

Draft code for the records method:
{noformat}
 {noformat}
 

 

 
{code:java}
def get_many_records(self, sql, parameters=None, chunksize=20, 
iterate_singles=False):
 """
 Executes the sql and returns a an generator for a set of records.
 :param sql: the sql statement to be executed (str) or a list of
 sql statements to execute
 :type sql: str or list
 :param parameters: The parameters to render the SQL query with.
 :type parameters: mapping or iterable
 :param chunksize: The number of records to fetch from the server with each 
roundtrip.
 :type chunksize: int
 :param iterate_singles: if the function yields one record at a time or sets of 
chunksize
 :type iterate_singles: bool
"""
 if sys.version_info[0] < 3:
 sql = sql.encode('utf-8')
with closing(self.get_conn()) as conn:
 with closing(conn.cursor()) as cur:
 if parameters is not None:
 cur.execute(sql, parameters)
 else:
 cur.execute(sql)
 while True:
 #import pdb; pdb.set_trace()
 results = cur.fetchmany(chunksize)
 if not results: break
 if iterate_singles:
 for result in results:
 yield result
 else:
 yield results
{code}
 

> Add chunking to DBAPI_hook by implementing fetchmany and pandas chunksize
> -------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3185
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3185
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 1.10.0
>            Reporter: Thomas Haederle
>            Assignee: Thomas Haederle
>            Priority: Minor
>              Labels: easyfix
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> DbApiHook currently implements get_records and get_pandas_df, where both 
> methods fetch all records into memory.
> We should implement two new methods which return a generator with a 
> configurable chunksize:
> - def get_many_records(self, sql, parameters=None, chunksize=20, 
> iterate_singles=False):
> - def get_pandas_df_chunks(self, sql, parameters=None, chunksize=20)
> this should work for all DB hooks which inherit from this class.
> We could also adapt existing methods, but that could be problematic because 
> these methods will return a generator whereas the others return either 
> records or dataframes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-3185) Add chunking to DBAPI_hook by implementing fetchmany and pandas chunksize

Reply via email to