Re: [scikit-learn] adding BM25 relevance function

Andreas Mueller Wed, 15 Jun 2016 10:56:36 -0700

I don't see an unresolved reference to xrange, but I do see that itcan't import sklearn.

Did you built scikit-learn?
See:
http://scikit-learn.org/dev/developers/contributing.html#retrieving-the-latest-code\


Either do

make
or
python setup.py build_ext -i
or
python setup.py develop
or
pip install . -e

(which all do slightly different things)

I'd probably go with the first if you have another installation ofscikit-learn on your machine

and the last if you want to make that your primary installation.

Cheers,
Andy

On 06/15/2016 01:01 AM, Basil Beirouti wrote:

Hello Pavel and Joel,
I forked the repository and cloned it on my machine. I'm using pycharmon a Mac, and while looking at text.py, I'm getting an unresolvedreference for "xrange" at line 28:
from ..externals.six.movesimport range
Pycharm says Function 'six.py' is too large to analyze, so I'm notsure if this error is somehow related to that. I decided to try tobuild the code as a sanity check but I can't find any reliableinstructions as to how to do that. Naively, I opened terminal and cdto the directory above "scikit-learn" folder (where I had cloned myfork) and tried to run:
$ python3 setup.py install

Which didn't work. I got this error:

ImportError: No module named 'sklearn'
Can someone point me in the right direction? And how can the code tryto import sklearn if it doesn't exist yet? Note I haven't installedthe release version of scikit-learn using pip or any other tool, but Ishould be able to bootstrap it from the source code, right?
Here's the full error message if it helps. Forgive me if it's a sillymistake, but I haven't found any reliable guidelines online.
  File "setup.py", line 84, in <module>

    from numpy.distutils.core import setup
File"/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py",line 26, in <module>
    from numpy.distutils.command import config, config_compiler, \
File"/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py",line 18, in <module>
    from numpy.distutils.system_info import combine_paths
File"/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py",line 232, in <module>
    triplet = str(p.communicate()[0].decode().strip())
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",line 791, in communicate
    stdout = _eintr_retry_call(self.stdout.read)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",line 476, in _eintr_retry_call
    return func(*args)

KeyboardInterrupt

Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install

non-existing path in '__check_build': '_check_build.c'

Appending sklearn.__check_build configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build')

Appending sklearn._build_utils configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils')

Appending sklearn.covariance configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance')

Appending sklearn.covariance/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.covariance/tests')
Appending sklearn.cross_decomposition configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.cross_decomposition')
Appending sklearn.cross_decomposition/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.cross_decomposition/tests')
Appending sklearn.feature_selection configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.feature_selection')
Appending sklearn.feature_selection/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.feature_selection/tests')
Appending sklearn.gaussian_process configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.gaussian_process')
Appending sklearn.gaussian_process/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.gaussian_process/tests')
Appending sklearn.mixture configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture')

Appending sklearn.mixture/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests')

Appending sklearn.model_selection configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.model_selection')
Appending sklearn.model_selection/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.model_selection/tests')
Appending sklearn.neural_network configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.neural_network')
Appending sklearn.neural_network/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.neural_network/tests')
Appending sklearn.preprocessing configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing')

Appending sklearn.preprocessing/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.preprocessing/tests')
Appending sklearn.semi_supervised configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.semi_supervised')
Appending sklearn.semi_supervised/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to'sklearn.semi_supervised/tests')
Warning: Assuming default configuration(./_build_utils/{setup__build_utils,setup}.py was not found)Warning:Assuming default configuration(./covariance/{setup_covariance,setup}.py was not found)Warning:Assuming default configuration(./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.pywas not found)Warning: Assuming default configuration(./cross_decomposition/{setup_cross_decomposition,setup}.py was notfound)Warning: Assuming default configuration(./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.pywas not found)Warning: Assuming default configuration(./feature_selection/{setup_feature_selection,setup}.py was notfound)Warning: Assuming default configuration(./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.pywas not found)Warning: Assuming default configuration(./gaussian_process/{setup_gaussian_process,setup}.py was notfound)Warning: Assuming default configuration(./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.pywas not found)Warning: Assuming default configuration(./mixture/{setup_mixture,setup}.py was not found)Warning: Assumingdefault configuration(./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was notfound)Warning: Assuming default configuration(./model_selection/{setup_model_selection,setup}.py was notfound)Warning: Assuming default configuration(./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.pywas not found)Warning: Assuming default configuration(./neural_network/{setup_neural_network,setup}.py was notfound)Warning: Assuming default configuration(./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.pywas not found)Warning: Assuming default configuration(./preprocessing/{setup_preprocessing,setup}.py was not found)Warning:Assuming default configuration(./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.pywas not found)Warning: Assuming default configuration(./semi_supervised/{setup_semi_supervised,setup}.py was notfound)Warning: Assuming default configuration(./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.pywas not found)Traceback (most recent call last):
  File "setup.py", line 85, in <module>

    setup(**configuration(top_path='').todict())

  File "setup.py", line 44, in configuration

    config.add_subpackage('cluster')
File"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",line 1003, in add_subpackage
    caller_level = 2)
File"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",line 972, in get_subpackage
    caller_level = caller_level + 1)
File"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",line 884, in _get_configuration_from_setup_py
    ('.py', 'U', 1))
File"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py",line 234, in load_module
    return load_source(name, filename, file)
File"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py",line 172, in load_source
    module = _load(spec)

  File "<frozen importlib._bootstrap>", line 693, in _load

  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 662, in exec_module
File "<frozen importlib._bootstrap>", line 222, in_call_with_frames_removed
  File "./cluster/setup.py", line 8, in <module>

    from sklearn._build_utils import get_blas_info

ImportError: No module named 'sklearn'
On Tue, Jun 14, 2016 at 11:41 AM, <[email protected]<mailto:[email protected]>> wrote:
    Send scikit-learn mailing list submissions to
    [email protected] <mailto:[email protected]>

    To subscribe or unsubscribe via the World Wide Web, visit
    https://mail.python.org/mailman/listinfo/scikit-learn
    or, via email, send a message with subject or body 'help' to
    [email protected]
    <mailto:[email protected]>

    You can reach the person managing the list at
    [email protected] <mailto:[email protected]>

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of scikit-learn digest..."


    Today's Topics:

       1. Re: Adding BM25 relevance function (Pavel Soriano)
       2. Re: The culture of commit squashing (Andreas Mueller)
       3. Re: The culture of commit squashing (Tom DLT)


    ----------------------------------------------------------------------

    Message: 1
    Date: Tue, 14 Jun 2016 16:11:10 +0000
    From: Pavel Soriano <[email protected]
    <mailto:[email protected]>>
    To: Scikit-learn user and developer mailing list
            <[email protected] <mailto:[email protected]>>
    Subject: Re: [scikit-learn] Adding BM25 relevance function
    Message-ID:
<can0wwk93r2aw9no65cgicw5hqg7-ofyvzamjqpxpegtxmsq...@mail.gmail.com <mailto:can0wwk93r2aw9no65cgicw5hqg7-ofyvzamjqpxpegtxmsq...@mail.gmail.com>>
    Content-Type: text/plain; charset="utf-8"

    Hey,

    Good thing that you are trying to finish this.

    Well, I looked into my old notes, and the Delta tf-idf comes from
    the "Delta
    TFIDF: An Improved Feature Space for Sentiment Analysis"
    <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper.
    I guess
    it is not very popular and apparently it has a drawback: it does
    not take
    into account the number of times a word occurs in each document while
    calculating the distribution amongst classes. At least that is
    what I wrote
    on my notes...

    As for the delta idf... If it helps, I can look into my old code
    cause I do
    not know what I was talking about. I guess it has to do somehow
    with the
    paper cited before.

    Cheers,

    Pavel Soriano




    On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti
    <[email protected] <mailto:[email protected]>>
    wrote:

    > Hi Joel,
    >
    > Thanks for your response and for digging up that archived
    thread, it gives
    > me a lot of clarity.
    >
    > I see your point about BM25, but I think in most cases where
    TFIDF makes
    > sense, BM25 makes sense as well, but it could be "overkill".
    >
    > Consider that TFIDF does not produce normalized results either
    >
    
<http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py>,
    > If BM25 requires dimensionality reduction (eg. using LSA) , so
    too would
    > TFIDF. The term-document matrix is the same size no matter which
    weighting
    > scheme is used. The only difference is that BM25 produces better
    results
    > when the corpus is large enough that the term frequency in a
    document, and
    > the document frequency in the corpus, can vary considerably
    across a broad
    > range of values.Maybe you could even say TFIDF and BM25 are the same
    > equation except BM25 has a few additional hyperparameters (b and k).
    >
    > So is the advantage that BM25 provides for large diverse corpora
    with it?
    > or is it marginal? Perhaps you can point me to some more
    examples where
    > TFIDF is used (in supervised setting preferably) and I can plug
    in BM25 in
    > place of TFIDF and see how it compares. Here are some I found:
    >
    >
    >
    
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
    > *(supervised)*
    >
    >
    
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
    > (*unsupervised)*
    >
    > Thank you!
    > Basil
    >
    > PS: By the way, I'm not familiar with the delta-idf transform
    that Pavel
    > mentions in the archive you linked, I'll have to delve deeper
    into that. I
    > agree with the response to Pavel that he should be putting it in
    a separate
    > class, not adding on to the TFIDF. I think it would take me
    about 6-8 weeks
    > to adapt my code to the fit transform model and submit a pull
    request.
    >
    >
    >
    >
    >
    >
    > _______________________________________________
    > scikit-learn mailing list
    > [email protected] <mailto:[email protected]>
    > https://mail.python.org/mailman/listinfo/scikit-learn
    >
    --
    Pavel SORIANO

    PhD Student
    ERIC Laboratory
    Universit? de Lyon
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL:
    
<http://mail.python.org/pipermail/scikit-learn/attachments/20160614/cbe49979/attachment-0001.html>

    ------------------------------

    Message: 2
    Date: Tue, 14 Jun 2016 12:13:29 -0400
    From: Andreas Mueller <[email protected] <mailto:[email protected]>>
    To: Scikit-learn user and developer mailing list
            <[email protected] <mailto:[email protected]>>
    Subject: Re: [scikit-learn] The culture of commit squashing
    Message-ID: <[email protected]
    <mailto:[email protected]>>
    Content-Type: text/plain; charset="windows-1252"; Format="flowed"

    I'm +1 for using the button when appropriate.
    I think it should be up to the merging person to make a call whether a
    squash is a better
    logical unit than all the commits.
    I would set like a soft limit at ~5 commits or something. If your
    PR has
    more than 5 separate
    big logical units, it's probably too big.

    The button is enabled in the settings but I can't see it.
    Am I being stupid?

    On 06/14/2016 06:58 AM, Joel Nothman wrote:
    > Sounds good to me. Thank goodness someone reads the documentation!
    >
    > On 14 June 2016 at 19:51, Alexandre Gramfort
    > <[email protected]
    <mailto:[email protected]>
    > <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    >
    >     > We could stop squashing during development, and use the
    new Squash-and-Merge
    >     > button on GitHub.
    >     > What do you think?
    >
    >     +1
    >
    >     the reason I see for squashing during dev is to avoid
    killing the
    >     browser when reviewing. It really rarely happens though.
    >
    >     A
    >  _______________________________________________
    >     scikit-learn mailing list
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    > https://mail.python.org/mailman/listinfo/scikit-learn
    >
    >
    >
    >
    > _______________________________________________
    > scikit-learn mailing list
    > [email protected] <mailto:[email protected]>
    > https://mail.python.org/mailman/listinfo/scikit-learn

    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL:
    
<http://mail.python.org/pipermail/scikit-learn/attachments/20160614/135d4c27/attachment-0001.html>

    ------------------------------

    Message: 3
    Date: Tue, 14 Jun 2016 18:40:39 +0200
    From: Tom DLT <[email protected]
    <mailto:[email protected]>>
    To: Scikit-learn user and developer mailing list
            <[email protected] <mailto:[email protected]>>
    Subject: Re: [scikit-learn] The culture of commit squashing
    Message-ID:
<CAGKmC=sRMbwo1Pjm=ph3r6oqsmvzuzdbmjvj09yjwkk0+yq...@mail.gmail.com <mailto:ph3r6oqsmvzuzdbmjvj09yjwkk0%[email protected]>>
    Content-Type: text/plain; charset="utf-8"

    @Andreas
    It's a bit hidden: You need to click on "Merge pull-request", then
    do *not*
    click on "Confirm merge", but on the small arrow to the right, and
    select
    "Squash and merge".

    2016-06-14 18:13 GMT+02:00 Andreas Mueller <[email protected]
    <mailto:[email protected]>>:

    > I'm +1 for using the button when appropriate.
    > I think it should be up to the merging person to make a call
    whether a
    > squash is a better
    > logical unit than all the commits.
    > I would set like a soft limit at ~5 commits or something. If
    your PR has
    > more than 5 separate
    > big logical units, it's probably too big.
    >
    > The button is enabled in the settings but I can't see it.
    > Am I being stupid?
    >
    >
    > On 06/14/2016 06:58 AM, Joel Nothman wrote:
    >
    > Sounds good to me. Thank goodness someone reads the documentation!
    >
    > On 14 June 2016 at 19:51, Alexandre Gramfort <
    > [email protected]
    <mailto:[email protected]>> wrote:
    >
    >> > We could stop squashing during development, and use the new
    >> Squash-and-Merge
    >> > button on GitHub.
    >> > What do you think?
    >>
    >> +1
    >>
    >> the reason I see for squashing during dev is to avoid killing the
    >> browser when reviewing. It really rarely happens though.
    >>
    >> A
    >> _______________________________________________
    >> scikit-learn mailing list
    >> [email protected] <mailto:[email protected]>
    >> https://mail.python.org/mailman/listinfo/scikit-learn
    >>
    >
    >
    >
    > _______________________________________________
    > scikit-learn mailing
    
[email protected]https://mail.python.org/mailman/listinfo/scikit-learn
    <http://mail.python.org/mailman/listinfo/scikit-learn>
    >
    >
    >
    > _______________________________________________
    > scikit-learn mailing list
    > [email protected] <mailto:[email protected]>
    > https://mail.python.org/mailman/listinfo/scikit-learn
    >
    >
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL:
    
<http://mail.python.org/pipermail/scikit-learn/attachments/20160614/511d2a1d/attachment.html>

    ------------------------------

    Subject: Digest Footer

    _______________________________________________
    scikit-learn mailing list
    [email protected] <mailto:[email protected]>
    https://mail.python.org/mailman/listinfo/scikit-learn


    ------------------------------

    End of scikit-learn Digest, Vol 3, Issue 27
    *******************************************




_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] adding BM25 relevance function

Reply via email to