Hi all,
I am trying to do tfidf/lsa on pre-tokenized data (MeSH tags for any
biology folks out there) and am trying to skip tokenization since
pre-processing has already done so.
Unfortunately I am having trouble follow the 'tips and tricks' in the doc:
Some tips and tricks:
If documents are pre-tokenized by an external package, then store them in
files (or strings) with the tokens separated by whitespace and pass
analyzer=str.split
This won't work for me because my tokens (MeSH tags) can be one word, or a
phrase.
My attempt at a workaround is to instead save the set of tokens for each
sample as a bar delimited string and using:
def pre_tokenizer(doc):
return doc.split("|")
tfidf = TfidfVectorizer(tokenizer=pre_tokenized)
tfidf.fit(content)
where content is a list of bar-delmited strings
If anyone has any recommendations for a better way, or how to fix my broken
way I would really appreciate it!
traceback:
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 1292, in transform
X = super(TfidfVectorizer, self).transform(raw_documents)
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 855, in transform
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 741, in _count_vocab
for feature in analyze(doc):
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 233, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 199, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
--
Patrick Short
------------------------------
University of North Carolina at Chapel Hill, 2014
Applied Mathematics and Quantitative Biology
pjshor...@gmail.com | 919-455-7045 C
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general