As Deng mentioned, consider combining the operators. The Airflow documentation used to say, “if you need to use data between tasks, consider combining them into a single operator. But if you must have separate tasks, there is xcom.”
Sent from my iPhone > On 26 Dec 2019, at 4:12 am, Anton Zayniev <[email protected]> wrote: > > Maybe the simpliest solution would be generating a temp csv file from > pandas, pass it's path through xcom to next task. To make it idempotent you > can dynamically generate filename to avoid collisions. > >> On Wed, Dec 25, 2019, 16:55 Jarek Potiuk <[email protected]> wrote: >> >> I think it really depends what kind of data, what size, which frequency you >> are going to use it for and what will be the use pattern. It's best to make >> a conscious choice based on knowing the options you have :). >> >> There are a number of options on top of the mentioned above. From what I >> hear - Avro becomes more and more popular - most of the services (like BQ >> and others) support it. Also Parquet is an interesting one and natively >> supported by Panda. >> >> There are some converters that can be used to convert between different >> formats (for example https://github.com/ynqa/pandavro for panda<>avro or >> "to_parquet" method built in panda itself: >> >> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html >> ). >> Avro is record based (like CSV) with nested data capability, where Parquet >> is column based (where set of columns can change over time). >> >> But those are just a few examples and it's up to you to choose the right >> approach for you, so here are some articles to explore: >> >> - Here you can find nice comparison/benchmark of different formats for >> Panda serialisation >> >> https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d >> - Also nice explanation in SO what are the benefits of using Parquet: >> >> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats >> - And finally very nice article describing different types of file >> formats (record, column, nested, hierarchical, array, model...) - >> including >> comparisons and properties of each type: >> >> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats >> >> >> J. >> >> >> >> >> On Tue, Dec 24, 2019 at 10:50 AM Deng Xiaodong <[email protected]> >> wrote: >> >>> Yep, exactly what I suggested below. >>> >>> In terms of format, Feather (suggested by Robin below) should be favoured >>> over .csv given it persists schema as well. >>> >>> >>> XD >>> >>> On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek < >> [email protected] >>>> >>> wrote: >>> >>>> Personally I would use a .csv format and store the file on a S3/GCS >>> bucket. >>>> Xcom is meant to store small amount of data. >>>> >>>> T. >>>> >>>> On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <[email protected]> >> wrote: >>>> >>>>> Feather is probably a good option for data frames: >>>>> >>>>> >>>>> >>>> >>> >> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html >>>>> >>>>> R >>>>> >>>>> On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <[email protected]> >>> wrote: >>>>>> >>>>>> Hi David. >>>>>> >>>>>> The only “out of box” way to share data/information between tasks >> is >>>>> XCom ( >>>>>> >>>>> >>>> >>> >> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms >>>>> ). >>>>>> >>>>>> For you case, the quick suggestion I can share is >>>>>> >>>>>> - either merging your tasks >>>>>> - or persisting your Pandas Dataframes somewhere then load it in >> your >>>> 2nd >>>>>> task (e.g. using pickle) >>>>>> >>>>>> >>>>>> XD >>>>>> >>>>>> On Tue, Dec 24, 2019 at 15:00 David Muñoz < >> [email protected] >>>> >>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Excuse me, I am new to this and maybe this topic has already been >>>>> treated. >>>>>>> >>>>>>> I would like to know if there is a way to "share/pass" pandas >>>>> dataframes >>>>>>> between tasks in airflow. >>>>>>> >>>>>>> Any help would be appreciated. >>>>>>> >>>>>>> Thank you!!! >>>>>>> >>>>>>> David. >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Tomasz Urbaszek >>>> Polidea <https://www.polidea.com/> | Software Engineer >>>> >>>> M: +48 505 628 493 <+48505628493> >>>> E: [email protected] <[email protected]> >>>> >>>> Unique Tech >>>> Check out our projects! <https://www.polidea.com/our-work> >>>> >>> >> >> >> -- >> >> Jarek Potiuk >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> M: +48 660 796 129 <+48660796129> >> [image: Polidea] <https://www.polidea.com/> >>
