Re: Spark FP-growth

Sean Owen Thu, 07 May 2020 10:52:25 -0700

Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.


But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.

I can see how limiting the rule size could help.

On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli <dyex...@gmail.com> wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are 
> introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this 
> may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help 
> greatly.
>
> Reducing the support doesn't really work in my case simply because rules for 
> the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that 
> to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen, <sro...@gmail.com> wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli <dyex...@gmail.com> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth 
>> > algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't 
>> > need the rules for all the other consequents. I know I can filter the 
>> > rules to get the desired output, but if I could input this in the 
>> > algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6. Again, 
>> > I can filter on length but I was wondering if we could take this as input 
>> > into the algo. Given the Depth first nature of FP-Growth, I am not sure 
>> > that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are 
>> > feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark FP-growth

Reply via email to