Hi Dmitriy,

Here's the code. It does cooccurrence analysis with loglikelihood ratio
tests. Haven't run it on a cluster yet:

https://gist.github.com/sscdotopen/8314254

--sebastian

On 07.01.2014 23:53, Dmitriy Lyubimov wrote:
> @Sebastian,
> wanna post a link?
> 
> 
> On Tue, Jan 7, 2014 at 2:46 PM, Sebastian Schelter <[email protected]> wrote:
> 
>> I also have some spark cooccurrence analysis code lying around that
>> might be a nice contribution.
>>
>> On 07.01.2014 23:44, Dmitriy Lyubimov wrote:
>>> if you want to contribute to Mahout, obviously you want to speak to
>> Mahout
>>> dev audience. Spark is not yet officially integrated into Mahout, but we
>>> are actively contemplating it and I have been doing some work off SVN
>> e.g.
>>> https://issues.apache.org/jira/browse/MAHOUT-1346,
>>> https://issues.apache.org/jira/browse/MAHOUT-1365 and some other
>> algorithm
>>> ports.
>>>
>>>
>>> On Tue, Jan 7, 2014 at 1:30 PM, Oleksandr Olgashko <
>> [email protected]
>>>> wrote:
>>>
>>>> Didn't work with Spark before (just read their overview page).
>>>> Should i ask arising questions here or better switch to Spark's mailing
>>>> lists?
>>>>
>>>>
>>>> 2014/1/7 Sebastian Schelter <[email protected]>
>>>>
>>>>> IIRC that papers talks about MapReduce on a shared-memory system, not
>> on
>>>>> a shared-nothing system such as the Hadoop implementation.
>>>>>
>>>>> As a rule of thumb, iterations in Hadoop are about 10x slower than in
>>>>> systems such as Giraph, Spark or Stratosphere.
>>>>>
>>>>> --sebastian
>>>>>
>>>>> On 07.01.2014 22:01, Oleksandr Olgashko wrote:
>>>>>> What can you say about
>>>>>>
>>>>>
>>>>
>> http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf
>>>>> ?
>>>>>>
>>>>>>
>>>>>> 2014/1/7 Dmitriy Lyubimov <[email protected]>
>>>>>>
>>>>>>> yes. Create working notes how exactly to do that.  (Or, what i am a
>>>> bit
>>>>>>> pushing you towards, Spark, since MR is not really iteration friendly
>>>>>>> platform and it looks like iterations are needed in fastICA.).
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 7, 2014 at 12:38 PM, Oleksandr Olgashko <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> So the problem is to adapt ICA for MR, am i right?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014/1/7 Dmitriy Lyubimov <[email protected]>
>>>>>>>>
>>>>>>>>> i already looked at fast ICA. while it claims to be parallel, this
>>>>> work
>>>>>>>>> doesn't exactly map it into map reduce (or spark) paradigm and from
>>>>>>> what
>>>>>>>> i
>>>>>>>>> can recollect still implies outer iterations for fitting principal
>>>>>>>>> component vectors one by one. Which means it probably already is
>>>>>>>>> MR-unfriendly by construction; Spark may show far better promise
>>>> here
>>>>>>> but
>>>>>>>>> still a working notes document is required to show how exactly.
>>>> that's
>>>>>>>> what
>>>>>>>>> i mean.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 7, 2014 at 1:35 AM, Oleksandr Olgashko <
>>>>>>>>> [email protected]
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Could you please take a look on this article?
>>>>>>>>>> http://cran.r-project.org/web/packages/fastICA/fastICA.pdf
>>>>>>>>>> I have learned that re-inventing the wheel is wrong for most
>>>>>>> problems,
>>>>>>>>> and
>>>>>>>>>> usually exists a better solution. However, it often needs some
>>>>>>>>> "grinding",
>>>>>>>>>> so I may research those ways, in case of approval.
>>>>>>>>>>
>>>>>>>>>> About Scala: unfortunately, I have never worked with this language
>>>>>>>>> before,
>>>>>>>>>> but wanted to. I'd like to fill that gap in my skills, but I don't
>>>>>>> know
>>>>>>>>>> exactly where to start.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014/1/7 Dmitriy Lyubimov <[email protected]>
>>>>>>>>>>
>>>>>>>>>>> ICA is a very useful technique for dimensionality reduction. I
>>>>>>>> believe
>>>>>>>>>>> Mahout would benefit from it; however challenges are fairly
>>>>>>>> significant
>>>>>>>>>> in
>>>>>>>>>>> terms of proven parallelization technique and acceptable
>> efficacy,
>>>>>>>>> which
>>>>>>>>>>> makes it hard to just "implement" (I am not familiar at this
>> point
>>>>>>>> with
>>>>>>>>>> any
>>>>>>>>>>> concrete work on parallel ICA). So like i said before i am not
>>>> very
>>>>>>>>>>> hopeful. However, if one never tries, then nothing will get ever
>>>>>>>> done.
>>>>>>>>>> who
>>>>>>>>>>> knows.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 6, 2014 at 2:18 PM, Isabel Drost-Fromm <
>>>>>>>> [email protected]
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 06, 2014 at 10:40:45PM +0200, Oleksandr Olgashko
>>>>>>> wrote:
>>>>>>>>>>>>> Returning back to question about theme to work, asked 2 months
>>>>>>>> ago.
>>>>>>>>>>>>> What algorithm should I implement?
>>>>>>>>>>>>
>>>>>>>>>>>> To be quite frank with you: None. Personally I'd rather see
>>>>>>>>>> improvements
>>>>>>>>>>>> (in terms of documentation, integration, stableisation,
>>>>>>> performance
>>>>>>>>>>>> optimisation) of the existing Mahout source.
>>>>>>>>>>>>
>>>>>>>>>>>> Feel free to take a closer look at the thread concerning
>> "getting
>>>>>>>>>>>> involved" that we had around Christmas last year for
>> inspiration.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Isabel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
> 

Reply via email to