If the goal is to split the output, then `DataFrameWriter.partitionBy` should do what you need, and no additional methods are required. If not you can also check Silex's implementation muxPartitions (see https://stackoverflow.com/a/37956034), but the applications are rather limited, due to high resource usage.
On Sun, 3 Feb 2019 at 15:41, Sean Owen <[email protected]> wrote: > I don't think Spark supports this model, where N inputs depending on > parent are computed once at the same time. You can of course cache the > parent and filter N times and do the same amount of work. One problem is, > where would the N inputs live? they'd have to be stored if not used > immediately, and presumably in any use case, only one of them would be used > immediately. If you have a job that needs to split records of a parent into > N subsets, and then all N subsets are used, you can do that -- you are just > transforming the parent to one child that has rows with those N splits of > each input row, and then consume that. See randomSplit() for maybe the best > case, where it still produce N Datasets but can do so efficiently because > it's just a random sample. > > On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini <[email protected]> wrote: > >> I don't consider it as method to apply filtering multiple time, instead >> use it as semi-action not just transformation. Let's think that we have >> something like map-partition which accept multiple lambda that each one >> collect their ROW for their dataset (or something like it). Is it possible? >> >> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <[email protected]> wrote: >> >>> I think the problem is that can't produce multiple Datasets from one >>> source in one operation - consider that reproducing one of them would mean >>> reproducing all of them. You can write a method that would do the filtering >>> multiple times but it wouldn't be faster. What do you have in mind that's >>> different? >>> >>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <[email protected]> >>> wrote: >>> >>>> I've seen many application need to split dataset to multiple datasets >>>> based on some conditions. As there is no method to do it in one place, >>>> developers use *filter *method multiple times. I think it can be >>>> useful to have method to split dataset based on condition in one iteration, >>>> something like *partition* method of scala (of-course scala partition >>>> just split list into two list, but something more general can be more >>>> useful). >>>> If you think it can be helpful, I can create Jira issue and work on it >>>> to send PR. >>>> >>>> Best Regards >>>> Moein >>>> >>>> -- >>>> >>>> Moein Hosseini >>>> Data Engineer >>>> mobile: +98 912 468 1859 <+98+912+468+1859> >>>> site: www.moein.xyz >>>> email: [email protected] >>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm> >>>> [image: twitter] <https://twitter.com/moein7tl> >>>> >>>> >> >> -- >> >> Moein Hosseini >> Data Engineer >> mobile: +98 912 468 1859 <+98+912+468+1859> >> site: www.moein.xyz >> email: [email protected] >> [image: linkedin] <https://www.linkedin.com/in/moeinhm> >> [image: twitter] <https://twitter.com/moein7tl> >> >> -- Regards, Maciej
