[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952618#comment-15952618
 ] 

Cyril de Vogelaere edited comment on SPARK-20180 at 4/2/17 4:50 PM:
--------------------------------------------------------------------

Yes they can, it's really not a critical issue at all.
Current pattern length work also well for most in practice, except for very 
large datasets where sequence are very long. But then I suppose people
would know about the parameter, and set it to a large value.

However allowing unlimitted pattern length would cost nothing in terms of 
performance, it's just an additionnal condition in an if. And may be easier 
than always setting the highest value possible. At least, that option wouldn't 
hurt and there was a TODO for that in the code. I think it would be good, even 
if we don't change the default value of 10. Changing it make more sense to me, 
but I get that, to allow backward compatibility, we can't just change things as 
we want. So I will follow my senior's opinion on this.


Actually, I have quite a few improvement in store for Prefix-span since I 
worked on an algorithm for my master thesis. Notably a very performant 
implementation that specialize PrefixSpan for single-item pattern, while 
slightly improving the performance of multi-item pattern. But I was told I 
needed to get familiar with contributing to spark first ^^', thus why I'm 
proposing this small, non critical, improvement, and implementing it.

I'm ready to push this small change anytime, it's already implemented. But the 
contributor wiki ask to run dev/run-tests before pushing, and it's been running 
for a day and a half already ... Is that normal by the way ? Also, the test 
already found some error, but I'm 99.999% sure they're not mine. They're not 
even from the mllib module, which is the only thing I modified  ... Is that 
normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^'


was (Author: syrux):
Yes they can, it's really not a critical issue at all.
Current pattern length work also well for most in practice, except for very 
large datasets where sequence are very long. But then I suppose people
would know about the parameter, and set it to a large value.

However changing it to create a default value allowing unlimitted pattern 
length would cost nothing in terms of performance, it's just an additionnal 
condition in an if. And may be easier than always setting the highest value 
possible. At least, that option wouldn't hurt and there was a TODO for that in 
the code.

Actually, I have quite a few improvement in store for Prefix-span since I 
worked on an algorithm for my master thesis. Notably a very performant 
implementation that specialize PrefixSpan for single-item pattern, while 
slightly improving the performance of multi-item pattern. But I was told I 
needed to get familiar with contributing to spark first ^^', thus why I'm 
proposing this small, non critical, improvement, and implementing it.

I'm ready to push this small change anytime, it's already implemented. But the 
contributor wiki ask to run dev/run-tests before pushing, and it's been running 
for a day and a half already ... Is that normal by the way ? Also, the test 
already found some error, but I'm 99.999% sure they're not mine. They're not 
even from the mllib module, which is the only thing I modified  ... Is that 
normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^'

> Unlimited max pattern length in Prefix span
> -------------------------------------------
>
>                 Key: SPARK-20180
>                 URL: https://issues.apache.org/jira/browse/SPARK-20180
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.0
>            Reporter: Cyril de Vogelaere
>            Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to