Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Xiangrui Meng Sun, 06 Apr 2014 21:15:07 -0700

The persist used in implicit ALS doesn't help StackOverflow problem.
Persist doesn't cut lineage. We need to call count() and then
checkpoint() to cut the lineage. Did you try the workaround mentioned
in https://issues.apache.org/jira/browse/SPARK-958:


"I tune JVM thread stack size to 512k via option -Xss512k and it works."

Best,
Xiangrui

On Sun, Apr 6, 2014 at 10:21 AM, Debasish Das <debasish.da...@gmail.com> wrote:
> At the head I see persist option in implicitPrefs but more cases like the
> ones mentioned above why don't we use similar technique and take an input
> that which iteration should we persist in explicit runs as well ?
>
> for (iter <- 1 to iterations) {
>         // perform ALS update
>         logInfo("Re-computing I given U (Iteration %d/%d)".format(iter,
> iterations))
>         products = updateFeatures(users, userOutLinks, productInLinks,
> partitioner, rank, lambda,
>           alpha, YtY = None)
>         logInfo("Re-computing U given I (Iteration %d/%d)".format(iter,
> iterations))
>         users = updateFeatures(products, productOutLinks, userInLinks,
> partitioner, rank, lambda,
>           alpha, YtY = None)
>       }
>
> Say if I want to persist at every k iterations out of N iterations of ALS
> explicit, there shoud be an option to do that...implicit right now uses
> persist at each iteration...
>
> Does this option make sense or you guys want this issue to be fixed in a
> different way...
>
> I definitely see that for my 25M x 3M run, with 64 gb executor memory,
> something is going wrong after 5-th iteration and I wanted to run for 10
> iterations...
>
> So my k is 4/5 for this particular problem...
>
> I can ask for the PR after testing the fix on the dataset I have...I will
> also try to see if we can make such datasets public for more research...
>
> For the LDA problem mentioned earlier in this email chain, k is 10...NMF
> can generate topics similar to LDA as well...Carrot2 project uses it...
>
>
>
> On Thu, Mar 27, 2014 at 3:20 PM, Debasish Das <debasish.da...@gmail.com>wrote:
>
>> Hi Matei,
>>
>> I am hitting similar problems with 10 ALS iterations...I am running with
>> 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50
>>
>> The first iteration of flatMaps run fine which means that the memory
>> requirements are good per iteration...
>>
>> If I do check-pointing on RDD, most likely rest 9 iterations will also run
>> fine and I will get the results...
>>
>> Is there a plan to add checkpoint option to ALS for such large
>> factorization jobs ?
>>
>> Thanks.
>> Deb
>>
>>
>>
>>
>>
>> On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia 
>> <matei.zaha...@gmail.com>wrote:
>>
>>> That would be great to add. Right now it would be easy to change it to
>>> use another Hadoop FileSystem implementation at the very least (I think you
>>> can just pass the URL for that), but for Cassandra you'd have to use a
>>> different InputFormat or some direct Cassandra access API.
>>>
>>> Matei
>>>
>>> On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote:
>>>
>>> > By the way, is there any plan to make a pluggable backend for
>>> > checkpointing?   We might be interested in writing a, for example,
>>> > Cassandra backend.
>>> >
>>> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan <junluan....@intel.com>
>>> wrote:
>>> >> Hi all
>>> >>
>>> >> The description about this Bug submitted by Matei is as following
>>> >>
>>> >>
>>> >> The tipping point seems to be around 50. We should fix this by
>>> checkpointing the RDDs every 10-20 iterations to break the lineage chain,
>>> but checkpointing currently requires HDFS installed, which not all users
>>> will have.
>>> >>
>>> >> We might also be able to fix DAGScheduler to not be recursive.
>>> >>
>>> >>
>>> >> regards,
>>> >> Andrew
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > --
>>> > Evan Chan
>>> > Staff Engineer
>>> > e...@ooyala.com  |
>>>
>>>
>>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to