I am trying
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
(6.2.2.3. Common Vectorizer usage).
I did
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print vectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
print X
analyze = vectorizer.build_analyzer()
but I get
CountVectorizer(analyzer=WordNGramAnalyzer(charset=utf-8, max_n=1, min_n=1,
preprocessor=RomanPreprocessor(),
stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over',
'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify',
'where', 'mill', 'only', 'find', 'before', 'one', ...er', 'without', 'so',
'five', 'the', 'first', 'whereas', 'once']),
token_pattern=\b\w\w+\b),
analyzer__charset=utf-8, analyzer__max_n=1, analyzer__min_n=1,
analyzer__preprocessor=RomanPreprocessor(),
analyzer__stop_words=set(['all', 'six', 'less', 'being', 'indeed',
'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves',
'fify', 'where', 'mill', 'only', 'find', 'before', 'one',
'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had',
'enough', 'should', 'to', 'must...'amoungst', 'yours', 'their', 'rather',
'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
analyzer__token_pattern=\b\w\w+\b, dtype=<type 'long'>, max_df=1.0,
max_features=None, vocabulary=None)
(0, 1) 1
(1, 0) 2
(1, 1) 1
(3, 1) 1
Traceback (most recent call last):
File "C:\Python27\probando6.2.2.py", line 13, in <module>
analyze = vectorizer.build_analyzer()
AttributeError: 'CountVectorizer' object has no attribute 'build_analyzer'
Prof. Dr. Andrés Soto
DES DACI
UNACAR
----- Forwarded Message -----
>From: Andres Soto <[email protected]>
>To: Robert Layton <[email protected]>;
>"[email protected]"
><[email protected]>
>Sent: Tuesday, August 7, 2012 11:30 AM
>Subject: [Scikit-learn-general] scikit-learn-0.11
>
>
>according to
>http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
> (6.2.2.3. Common Vectorizer usage),
>I did:
>>>> from sklearn.feature_extraction.text import CountVectorizer
>>>> vectorizer = CountVectorizer()
>but I get
>
>>>> vectorizer
>CountVectorizer(analyzer=WordNGramAnalyzer(charset='utf-8', max_n=1, min_n=1,
> preprocessor=RomanPreprocessor(),
> stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over',
>'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify',
>'where', 'mill', 'only', 'find', 'before', 'one'...without', 'so', 'five',
'the', 'first', 'whereas', 'once']),
> token_pattern='\\b\\w\\w+\\b'),
> dtype=<type 'long'>, max_df=1.0, max_features=None,
> vocabulary=None)
>in spite of
>
>>>> vectorizer CountVectorizer(analyzer='word', binary=False, charset='utf-8',
>>>> charset_error='strict', dtype=<type 'long'>, input='content',
>>>> lowercase=True, max_df=1.0, max_features=None, max_n=1, min_n=1,
>>>> preprocessor=None, stop_words=None, strip_accents=None,
>>>> token_pattern=u'\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
>as it says in that web page
>regards
>
>
>Prof. Dr. Andrés Soto
>DES DACI
>UNACAR
>
>
>
>>________________________________
>> From: Robert Layton <[email protected]>
>>To: Andres Soto <[email protected]>;
>>[email protected]
>>Sent: Monday, August 6, 2012 7:24 PM
>>Subject: Re: [Scikit-learn-general] scikit-learn-0.11 vs scikit-learn-0.9
>>
>>
>>On 7 August 2012 03:18, Andres Soto <[email protected]> wrote:
>>
>>Hi
>>>I am
using python-2.7.3, numpy-1.6.2-win32-superpack-python2.7,
scipy-0.11.0rc1-win32-superpack-python2.7,
scikit-learn-0.11.win32-py2.7
>>>I
tried the following
>>>
>>>>>>
train_set = ("The sky is blue.", "The sun is bright.")
>>>>>>
test_set = ("The sun in the sky is bright.",
>>>"We
can see the shining sun, the bright sun.")
>>>>>>
from sklearn.feature_extraction.text import CountVectorizer
>>>>>>
vectorizer = CountVectorizer()
>>>>>>
print vectorizer
>>>CountVectorizer(analyzer=word,
binary=False, charset=utf-8,
>>> charset_error=strict, dtype=<type
'long'>, input=content,
>>> lowercase=True, max_df=1.0,
max_features=None, max_n=1, min_n=1,
>>> preprocessor=None, stop_words=None,
strip_accents=None,
>>> token_pattern=\b\w\w+\b,
tokenizer=None, vocabulary=None)
>>>>>>
vectorizer.fit_transform(train_set)
>>><2x6
sparse matrix of type '<type 'numpy.int64'>'
>>> with 8 stored elements in COOrdinate
format>
>>>>>>
print vectorizer.vocabulary
>>>
>>>Traceback
(most recent call last):
>>> File "<pyshell#6>", line 1,
in <module>
>>> print vectorizer.vocabulary
>>>AttributeError:
'CountVectorizer' object has no attribute 'vocabulary'
>>>>>>
>>>
>>>I tried to fix the parameters of CountVectorizer (analyzer =
>>>WordNGramAnalyzer, vocabulary = dict) but
it didn’t work. Therefore I decided to install sklearn 0.9 and it works, so we
could say that everything is OK but I still would like to know what is wrong
with version sklearn 0.11
>>>Andrés Soto
>>>
>>>------------------------------------------------------------------------------
>>>Live Security Virtual Conference
>>>Exclusive live event will cover all the ways today's security and
>>>threat landscape has changed and how IT managers can respond. Discussions
>>>will include endpoint security, mobile security and the latest in malware
>>>threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>_______________________________________________
>>>Scikit-learn-general mailing list
>>>[email protected]
>>>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>Hi Andres,
>>
>>
>>Short answer: vocabulary_
>>>>> print vectorizer.vocabulary_
>>{u'blue': 0, u'sun': 4, u'is': 2, u'sky': 3, u'bright': 1, u'the': 5}
>>
>>
>>(You can view all the methods and attributes by using dir(vectorizer) )
>>
>>
>>Longer answer:
>>The interface for this part of scikits.learn has been significantly changed
>>in the time between those releases.
>>The changes make the section easier to use and maintain, which is why they
>>were updated, despite breaking compatability.
>>The updated documentation can be found
>>here: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
>>
>>
>>Hope that helps,
>>
>>
>>Robert
>>
>>
>>--
>>
>>Public key at: http://pgp.mit.edu/ Search for this email address and select
>>the key from "2011-08-19" (key id: 54BA8735)
>>
>>
>>
>>
>------------------------------------------------------------------------------
>Live Security Virtual Conference
>Exclusive live event will cover all the ways today's security and
>threat landscape has changed and how IT managers can respond. Discussions
>will include endpoint security, mobile security and the latest in malware
>threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>_______________________________________________
>Scikit-learn-general mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general