Re: Feature request: split dataset based on condition

Maciej Szymkiewicz Sun, 03 Feb 2019 07:29:29 -0800

If the goal is to split the output, then `DataFrameWriter.partitionBy`
should do what you need, and no additional methods are required. If not you
can also check Silex's implementation muxPartitions (see
https://stackoverflow.com/a/37956034), but the applications are rather
limited, due to high resource usage.


On Sun, 3 Feb 2019 at 15:41, Sean Owen <[email protected]> wrote:

> I don't think Spark supports this model, where N inputs depending on
> parent are computed once at the same time. You can of course cache the
> parent and filter N times and do the same amount of work. One problem is,
> where would the N inputs live? they'd have to be stored if not used
> immediately, and presumably in any use case, only one of them would be used
> immediately. If you have a job that needs to split records of a parent into
> N subsets, and then all N subsets are used, you can do that -- you are just
> transforming the parent to one child that has rows with those N splits of
> each input row, and then consume that. See randomSplit() for maybe the best
> case, where it still produce N Datasets but can do so efficiently because
> it's just a random sample.
>
> On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini <[email protected]> wrote:
>
>> I don't consider it as method to apply filtering multiple time, instead
>> use it as semi-action not just transformation. Let's think that we have
>> something like map-partition which accept multiple lambda that each one
>> collect their ROW for their dataset (or something like it). Is it possible?
>>
>> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <[email protected]> wrote:
>>
>>> I think the problem is that can't produce multiple Datasets from one
>>> source in one operation - consider that reproducing one of them would mean
>>> reproducing all of them. You can write a method that would do the filtering
>>> multiple times but it wouldn't be faster. What do you have in mind that's
>>> different?
>>>
>>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <[email protected]>
>>> wrote:
>>>
>>>> I've seen many application need to split dataset to multiple datasets
>>>> based on some conditions. As there is no method to do it in one place,
>>>> developers use *filter *method multiple times. I think it can be
>>>> useful to have method to split dataset based on condition in one iteration,
>>>> something like *partition* method of scala (of-course scala partition
>>>> just split list into two list, but something more general can be more
>>>> useful).
>>>> If you think it can be helpful, I can create Jira issue and work on it
>>>> to send PR.
>>>>
>>>> Best Regards
>>>> Moein
>>>>
>>>> --
>>>>
>>>> Moein Hosseini
>>>> Data Engineer
>>>> mobile: +98 912 468 1859 <+98+912+468+1859>
>>>> site: www.moein.xyz
>>>> email: [email protected]
>>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>>>> [image: twitter] <https://twitter.com/moein7tl>
>>>>
>>>>
>>
>> --
>>
>> Moein Hosseini
>> Data Engineer
>> mobile: +98 912 468 1859 <+98+912+468+1859>
>> site: www.moein.xyz
>> email: [email protected]
>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>> [image: twitter] <https://twitter.com/moein7tl>
>>
>>

-- 

Regards,
Maciej

Re: Feature request: split dataset based on condition

Reply via email to