It might improve the quality of the embedding at the expense of speed. For 
instance, you might want to try 
a density of 1 / 3 which is the value used in the Achlioptas paper [1].

For the number of components, you might want to play with the 
johnson_lindenstrauss_min_dim [2]
to get a better idea on the quality of your embeding. The number of components 
will greatly depends on you number
of samples and epsilon-quality of the embedding.

Best regards,
Arnaud

[1] Database-friendly random projections: Johnson-Lindenstrauss with binary 
coins. Dimitris Achlioptas. Microsoft Research
[2] 
http://scikit-learn.org/dev/modules/generated/sklearn.random_projection.johnson_lindenstrauss_min_dim.html#sklearn.random_projection.johnson_lindenstrauss_min_dim

On 08 Aug 2014, at 13:22, Philipp Singer <kill...@gmail.com> wrote:

> I always normalize X prior to the random projection as I have observed that 
> this always produces more accurate results (same for LSA/SVD).
> 
> Have not tried to increase eps as this would lead to much less features and 
> more error. I am also not sure how I should alter the density parameter. I 
> feel safer to use it to the auto value which calculates it according to the 
> Li et al paper. Could you recommend some value?
> 
> I think I will be more effective with LSA for now. Are there any specific 
> recommendations for the number of components? Chose 300 for now.
> 
> Best,
> Philipp
> 
> Am 08.08.2014 um 13:14 schrieb Arnaud Joly <a.j...@ulg.ac.be>:
> 
>> Have you tried to increase the number of components or epsilon parameter and 
>> density of the SparseRandomProjection?
>> Have you tried to normalise X prior the random projection?
>> 
>> Best regards,
>> Arnaud
>> 
>> On 08 Aug 2014, at 12:19, Philipp Singer <kill...@gmail.com> wrote:
>> 
>>> Just another remark regarding this:
>>> 
>>> I guess I can not circumvent the negative cosine similarity values. Maybe 
>>> LSA is a better approach? (TruncatedSVD)
>>> 
>>> Am 08.08.2014 um 10:35 schrieb Philipp Singer <kill...@gmail.com>:
>>> 
>>>> Hi,
>>>> 
>>>> I asked a question about the sparse random projection a few days ago, but 
>>>> thought I should start a new topic regarding my current problem.
>>>> 
>>>> I am calculating TFIDF weights for my text documents and then calculate 
>>>> cosine similarity between documents for determining the similarity between 
>>>> documents. For dimensionality reduction I am using the Sparse Random 
>>>> Projection class.
>>>> 
>>>> My current process looks like the following:
>>>> 
>>>> docs = [text1, text2,…]
>>>> vec = TfidfVectorizer(max_df=0.8)
>>>> X = vec.fit_transform(docs)
>>>> proj = SparseRandomProjection()
>>>> X2 = proj.fit_transform(X)
>>>> X2 = normalize(X2) #for L2 normalization
>>>> sim = X2 * X2.T
>>>> 
>>>> It works reasonable well. However, I found out that the sparse random 
>>>> projection sets many weights to a negative value. Hence, also many 
>>>> similarity scores end up being negative. Given the original intention of 
>>>> tfidf weights (which should never be negative) and corresponding cosine 
>>>> similarity scores (which then should always only range between zero and 
>>>> one), I do not know whether this is an appropriate approach for my task.
>>>> 
>>>> Hope someone has some advice. Maybe I am also doing something wrong here.
>>>> 
>>>> Best,
>>>> Philipp
>>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Want fast and easy access to all the code in your enterprise? Index and
>>> search up to 200,000 lines of code with a free copy of Black Duck
>>> Code Sight - the same software that powers the world's largest code
>>> search on Ohloh, the Black Duck Open Hub! Try it now.
>>> http://p.sf.net/sfu/bds_______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> ------------------------------------------------------------------------------
>> Want fast and easy access to all the code in your enterprise? Index and
>> search up to 200,000 lines of code with a free copy of Black Duck
>> Code Sight - the same software that powers the world's largest code
>> search on Ohloh, the Black Duck Open Hub! Try it now.
>> http://p.sf.net/sfu/bds_______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to