Re: [AirFlow]: Pandas DataFrame Between Tasks

Beau Barker Wed, 25 Dec 2019 16:34:53 -0800

As Deng mentioned, consider combining the operators.

The Airflow documentation used to say, “if you need to use data between tasks, 
consider combining them into a single operator. But if you must have separate 
tasks, there is xcom.”


Sent from my iPhone


> On 26 Dec 2019, at 4:12 am, Anton Zayniev <[email protected]> wrote:
> 
> Maybe the simpliest solution would be generating a temp csv file from
> pandas, pass it's path through xcom to next task. To make it idempotent you
> can dynamically generate filename to avoid collisions.
> 
>> On Wed, Dec 25, 2019, 16:55 Jarek Potiuk <[email protected]> wrote:
>> 
>> I think it really depends what kind of data, what size, which frequency you
>> are going to use it for and what will be the use pattern. It's best to make
>> a conscious choice based on knowing the options you have :).
>> 
>> There are a number of options on top of the mentioned above. From what I
>> hear - Avro becomes more and more popular - most of the services (like BQ
>> and others) support it.  Also Parquet is an interesting one and natively
>> supported by Panda.
>> 
>> There are some converters that can be used to convert between different
>> formats (for example https://github.com/ynqa/pandavro for panda<>avro or
>> "to_parquet" method built in panda itself:
>> 
>> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
>> ).
>> Avro is record based (like CSV) with nested data capability, where Parquet
>> is column based (where set of columns can change over time).
>> 
>> But those are just a few examples and it's up to you to choose the right
>> approach for you, so here are some articles to explore:
>> 
>>   - Here you can find nice comparison/benchmark of different formats for
>>   Panda serialisation
>> 
>> https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
>>   - Also nice explanation in SO what are the benefits of using Parquet:
>> 
>> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>>   - And finally very nice article describing different types of file
>>   formats (record, column, nested, hierarchical, array, model...) -
>> including
>>   comparisons and properties of each type:
>> 
>> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>> 
>> 
>> J.
>> 
>> 
>> 
>> 
>> On Tue, Dec 24, 2019 at 10:50 AM Deng Xiaodong <[email protected]>
>> wrote:
>> 
>>> Yep, exactly what I suggested below.
>>> 
>>> In terms of format, Feather (suggested by Robin below) should be favoured
>>> over .csv given it persists schema as well.
>>> 
>>> 
>>> XD
>>> 
>>> On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek <
>> [email protected]
>>>> 
>>> wrote:
>>> 
>>>> Personally I would use a .csv format and store the file on a S3/GCS
>>> bucket.
>>>> Xcom is meant to store small amount of data.
>>>> 
>>>> T.
>>>> 
>>>> On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <[email protected]>
>> wrote:
>>>> 
>>>>> Feather is probably a good option for data frames:
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
>>>>> 
>>>>> R
>>>>> 
>>>>> On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> Hi David.
>>>>>> 
>>>>>> The only “out of box” way to share data/information between tasks
>> is
>>>>> XCom (
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
>>>>> ).
>>>>>> 
>>>>>> For you case, the quick suggestion I can share is
>>>>>> 
>>>>>> - either merging your tasks
>>>>>> - or persisting your Pandas Dataframes somewhere then load it in
>> your
>>>> 2nd
>>>>>> task (e.g. using pickle)
>>>>>> 
>>>>>> 
>>>>>> XD
>>>>>> 
>>>>>> On Tue, Dec 24, 2019 at 15:00 David Muñoz <
>> [email protected]
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Excuse me, I am new to this and maybe this topic has already been
>>>>> treated.
>>>>>>> 
>>>>>>> I would like to know if there is a way to "share/pass" pandas
>>>>> dataframes
>>>>>>> between tasks in airflow.
>>>>>>> 
>>>>>>> Any help would be appreciated.
>>>>>>> 
>>>>>>> Thank you!!!
>>>>>>> 
>>>>>>> David.
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> Tomasz Urbaszek
>>>> Polidea <https://www.polidea.com/> | Software Engineer
>>>> 
>>>> M: +48 505 628 493 <+48505628493>
>>>> E: [email protected] <[email protected]>
>>>> 
>>>> Unique Tech
>>>> Check out our projects! <https://www.polidea.com/our-work>
>>>> 
>>> 
>> 
>> 
>> --
>> 
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>> 
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Reply via email to