Re: [scikit-learn] adding BM25 relevance function

Joel Nothman Wed, 15 Jun 2016 19:28:23 -0700

If xrange is the issue, then the branch you're getting may not have been
tested for Python 3.


On 16 June 2016 at 03:53, Andreas Mueller <[email protected]> wrote:

> I don't see an unresolved reference to xrange, but I do see that it can't
> import sklearn.
> Did you built scikit-learn?
> See:
>
> http://scikit-learn.org/dev/developers/contributing.html#retrieving-the-latest-code\
>
> Either do
>
> make
> or
> python setup.py build_ext -i
> or
> python setup.py develop
> or
> pip install . -e
>
> (which all do slightly different things)
>
> I'd probably go with the first if you have another installation of
> scikit-learn on your machine
> and the last if you want to make that your primary installation.
>
> Cheers,
> Andy
>
>
> On 06/15/2016 01:01 AM, Basil Beirouti wrote:
>
> Hello Pavel and Joel,
>
> I forked the repository and cloned it on my machine. I'm using pycharm on
> a Mac, and while looking at text.py, I'm getting an unresolved reference
> for "xrange" at line 28:
>
> from ..externals.six.moves import range
>
> Pycharm says Function 'six.py' is too large to analyze, so I'm not sure if 
> this error is somehow related to that. I decided to try to build the code as 
> a sanity check but I can't find any reliable instructions as to how to do 
> that. Naively, I opened terminal and cd to the directory above "scikit-learn" 
> folder (where I had cloned my fork) and tried to run:
>
> $ python3 setup.py install
>
> Which didn't work. I got this error:
>
> ImportError: No module named 'sklearn'
>
> Can someone point me in the right direction? And how can the code try to 
> import sklearn if it doesn't exist yet? Note I haven't installed the release 
> version of scikit-learn using pip or any other tool, but I should be able to 
> bootstrap it from the source code, right?
>
> Here's the full error message if it helps. Forgive me if it's a silly 
> mistake, but I haven't found any reliable guidelines online.
>
>   File "setup.py", line 84, in <module>
>
>     from numpy.distutils.core import setup
>
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py",
>  line 26, in <module>
>
>     from numpy.distutils.command import config, config_compiler, \
>
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py",
>  line 18, in <module>
>
>     from numpy.distutils.system_info import combine_paths
>
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py",
>  line 232, in <module>
>
>     triplet = str(p.communicate()[0].decode().strip())
>
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
>  line 791, in communicate
>
>     stdout = _eintr_retry_call(self.stdout.read)
>
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
>  line 476, in _eintr_retry_call
>
>     return func(*args)
>
> KeyboardInterrupt
>
> Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install
>
> non-existing path in '__check_build': '_check_build.c'
>
> Appending sklearn.__check_build configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build')
>
> Appending sklearn._build_utils configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils')
>
> Appending sklearn.covariance configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance')
>
> Appending sklearn.covariance/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests')
>
> Appending sklearn.cross_decomposition configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.cross_decomposition')
>
> Appending sklearn.cross_decomposition/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.cross_decomposition/tests')
>
> Appending sklearn.feature_selection configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection')
>
> Appending sklearn.feature_selection/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.feature_selection/tests')
>
> Appending sklearn.gaussian_process configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process')
>
> Appending sklearn.gaussian_process/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.gaussian_process/tests')
>
> Appending sklearn.mixture configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture')
>
> Appending sklearn.mixture/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests')
>
> Appending sklearn.model_selection configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection')
>
> Appending sklearn.model_selection/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.model_selection/tests')
>
> Appending sklearn.neural_network configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network')
>
> Appending sklearn.neural_network/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.neural_network/tests')
>
> Appending sklearn.preprocessing configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing')
>
> Appending sklearn.preprocessing/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.preprocessing/tests')
>
> Appending sklearn.semi_supervised configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised')
>
> Appending sklearn.semi_supervised/tests configuration to sklearn
>
> Ignoring attempt to set 'name' (from 'sklearn' to 
> 'sklearn.semi_supervised/tests')
>
> Warning: Assuming default configuration 
> (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: Assuming 
> default configuration (./covariance/{setup_covariance,setup}.py was not 
> found)Warning: Assuming default configuration 
> (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py was 
> not found)Warning: Assuming default configuration 
> (./cross_decomposition/{setup_cross_decomposition,setup}.py was not 
> found)Warning: Assuming default configuration 
> (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py
>  was not found)Warning: Assuming default configuration 
> (./feature_selection/{setup_feature_selection,setup}.py was not 
> found)Warning: Assuming default configuration 
> (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py
>  was not found)Warning: Assuming default configuration (./gaussian_process/{s
>  e
> tup_gaussian_process,setup}.py was not found)Warning: Assuming default 
> configuration 
> (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py
>  was not found)Warning: Assuming default configuration 
> (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming default 
> configuration (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py 
> was not found)Warning: Assuming default configuration 
> (./model_selection/{setup_model_selection,setup}.py was not found)Warning: 
> Assuming default configuration 
> (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py
>  was not found)Warning: Assuming default configuration 
> (./neural_network/{setup_neural_network,setup}.py was not found)Warning: 
> Assuming default configuration 
> (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py
>  was not found)Warning: Assuming default configuration 
> (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: Assumi
>  n
> g default configuration 
> (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py
>  was not found)Warning: Assuming default configuration 
> (./semi_supervised/{setup_semi_supervised,setup}.py was not found)Warning: 
> Assuming default configuration 
> (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py
>  was not found)Traceback (most recent call last):
>
>   File "setup.py", line 85, in <module>
>
>     setup(**configuration(top_path='').todict())
>
>   File "setup.py", line 44, in configuration
>
>     config.add_subpackage('cluster')
>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",
>  line 1003, in add_subpackage
>
>     caller_level = 2)
>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",
>  line 972, in get_subpackage
>
>     caller_level = caller_level + 1)
>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",
>  line 884, in _get_configuration_from_setup_py
>
>     ('.py', 'U', 1))
>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", 
> line 234, in load_module
>
>     return load_source(name, filename, file)
>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", 
> line 172, in load_source
>
>     module = _load(spec)
>
>   File "<frozen importlib._bootstrap>", line 693, in _load
>
>   File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
>
>   File "<frozen importlib._bootstrap_external>", line 662, in exec_module
>
>   File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
>
>   File "./cluster/setup.py", line 8, in <module>
>
>     from sklearn._build_utils import get_blas_info
>
> ImportError: No module named 'sklearn'
>
> On Tue, Jun 14, 2016 at 11:41 AM, <[email protected]> wrote:
>
>> Send scikit-learn mailing list submissions to
>>         [email protected]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>>         [email protected]
>>
>> You can reach the person managing the list at
>>         [email protected]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: Adding BM25 relevance function (Pavel Soriano)
>>    2. Re: The culture of commit squashing (Andreas Mueller)
>>    3. Re: The culture of commit squashing (Tom DLT)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 14 Jun 2016 16:11:10 +0000
>> From: Pavel Soriano <[email protected]>
>> To: Scikit-learn user and developer mailing list
>>         <[email protected]>
>> Subject: Re: [scikit-learn] Adding BM25 relevance function
>> Message-ID:
>>         <
>> can0wwk93r2aw9no65cgicw5hqg7-ofyvzamjqpxpegtxmsq...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hey,
>>
>> Good thing that you are trying to finish this.
>>
>> Well, I looked into my old notes, and the Delta tf-idf comes from the
>> "Delta
>> TFIDF: An Improved Feature Space for Sentiment Analysis"
>> <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess
>> it is not very popular and apparently it has a drawback: it does not take
>> into account the number of times a word occurs in each document while
>> calculating the distribution amongst classes. At least that is what I
>> wrote
>> on my notes...
>>
>> As for the delta idf... If it helps, I can look into my old code cause I
>> do
>> not know what I was talking about. I guess it has to do somehow with the
>> paper cited before.
>>
>> Cheers,
>>
>> Pavel Soriano
>>
>>
>>
>>
>> On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti <
>> <[email protected]>[email protected]>
>> wrote:
>>
>> > Hi Joel,
>> >
>> > Thanks for your response and for digging up that archived thread, it
>> gives
>> > me a lot of clarity.
>> >
>> > I see your point about BM25, but I think in most cases where TFIDF makes
>> > sense, BM25 makes sense as well, but it could be "overkill".
>> >
>> > Consider that TFIDF does not produce normalized results either
>> > <
>> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
>> >,
>> > If BM25 requires dimensionality reduction (eg. using LSA) , so too would
>> > TFIDF. The term-document matrix is the same size no matter which
>> weighting
>> > scheme is used. The only difference is that BM25 produces better results
>> > when the corpus is large enough that the term frequency in a document,
>> and
>> > the document frequency in the corpus, can vary considerably across a
>> broad
>> > range of values.Maybe you could even say TFIDF and BM25 are the same
>> > equation except BM25 has a few additional hyperparameters (b and k).
>> >
>> > So is the advantage that BM25 provides for large diverse corpora with
>> it?
>> > or is it marginal? Perhaps you can point me to some more examples where
>> > TFIDF is used (in supervised setting preferably) and I can plug in BM25
>> in
>> > place of TFIDF and see how it compares. Here are some I found:
>> >
>> >
>> >
>> http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
>> > *(supervised)*
>> >
>> >
>> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
>> > (*unsupervised)*
>> >
>> > Thank you!
>> > Basil
>> >
>> > PS: By the way, I'm not familiar with the delta-idf transform that Pavel
>> > mentions in the archive you linked, I'll have to delve deeper into
>> that. I
>> > agree with the response to Pavel that he should be putting it in a
>> separate
>> > class, not adding on to the TFIDF. I think it would take me about 6-8
>> weeks
>> > to adapt my code to the fit transform model and submit a pull request.
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > [email protected]
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> --
>> Pavel SORIANO
>>
>> PhD Student
>> ERIC Laboratory
>> Universit? de Lyon
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/cbe49979/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Tue, 14 Jun 2016 12:13:29 -0400
>> From: Andreas Mueller <[email protected]>
>> To: Scikit-learn user and developer mailing list
>>         <[email protected]>
>> Subject: Re: [scikit-learn] The culture of commit squashing
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>>
>> I'm +1 for using the button when appropriate.
>> I think it should be up to the merging person to make a call whether a
>> squash is a better
>> logical unit than all the commits.
>> I would set like a soft limit at ~5 commits or something. If your PR has
>> more than 5 separate
>> big logical units, it's probably too big.
>>
>> The button is enabled in the settings but I can't see it.
>> Am I being stupid?
>>
>> On 06/14/2016 06:58 AM, Joel Nothman wrote:
>> > Sounds good to me. Thank goodness someone reads the documentation!
>> >
>> > On 14 June 2016 at 19:51, Alexandre Gramfort
>> > <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     > We could stop squashing during development, and use the new
>> Squash-and-Merge
>> >     > button on GitHub.
>> >     > What do you think?
>> >
>> >     +1
>> >
>> >     the reason I see for squashing during dev is to avoid killing the
>> >     browser when reviewing. It really rarely happens though.
>> >
>> >     A
>> >     _______________________________________________
>> >     scikit-learn mailing list
>> >     [email protected] <mailto:[email protected]>
>> >     https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > [email protected]
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/135d4c27/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Tue, 14 Jun 2016 18:40:39 +0200
>> From: Tom DLT <[email protected]>
>> To: Scikit-learn user and developer mailing list
>>         <[email protected]>
>> Subject: Re: [scikit-learn] The culture of commit squashing
>> Message-ID:
>>         <CAGKmC=sRMbwo1Pjm=
>> <ph3r6oqsmvzuzdbmjvj09yjwkk0%[email protected]>
>> [email protected]>
>> Content-Type: text/plain; charset="utf-8"
>>
>> @Andreas
>> It's a bit hidden: You need to click on "Merge pull-request", then do
>> *not*
>> click on "Confirm merge", but on the small arrow to the right, and select
>> "Squash and merge".
>>
>> 2016-06-14 18:13 GMT+02:00 Andreas Mueller < <[email protected]>
>> [email protected]>:
>>
>> > I'm +1 for using the button when appropriate.
>> > I think it should be up to the merging person to make a call whether a
>> > squash is a better
>> > logical unit than all the commits.
>> > I would set like a soft limit at ~5 commits or something. If your PR has
>> > more than 5 separate
>> > big logical units, it's probably too big.
>> >
>> > The button is enabled in the settings but I can't see it.
>> > Am I being stupid?
>> >
>> >
>> > On 06/14/2016 06:58 AM, Joel Nothman wrote:
>> >
>> > Sounds good to me. Thank goodness someone reads the documentation!
>> >
>> > On 14 June 2016 at 19:51, Alexandre Gramfort <
>> > [email protected]> wrote:
>> >
>> >> > We could stop squashing during development, and use the new
>> >> Squash-and-Merge
>> >> > button on GitHub.
>> >> > What do you think?
>> >>
>> >> +1
>> >>
>> >> the reason I see for squashing during dev is to avoid killing the
>> >> browser when reviewing. It really rarely happens though.
>> >>
>> >> A
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> [email protected]
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing [email protected]://
>> mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > [email protected]
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/511d2a1d/attachment.html
>> >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ------------------------------
>>
>> End of scikit-learn Digest, Vol 3, Issue 27
>> *******************************************
>>
>
>
>
> _______________________________________________
> scikit-learn mailing 
> [email protected]https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] adding BM25 relevance function

Reply via email to