[Scikit-learn-general] pre-tokenized data (not splitting on white space)

Patrick Short Fri, 12 Sep 2014 14:55:13 -0700

Hi all,

I am trying to do tfidf/lsa on pre-tokenized data (MeSH tags for any
biology folks out there) and am trying to skip tokenization since
pre-processing has already done so.


Unfortunately I am having trouble follow the 'tips and tricks' in the doc:

Some tips and tricks:
If documents are pre-tokenized by an external package, then store them in
files (or strings) with the tokens separated by whitespace and pass
analyzer=str.split

This won't work for me because my tokens (MeSH tags) can be one word, or a
phrase.

My attempt at a workaround is to instead save the set of tokens for each
sample as a bar delimited string and using:

def pre_tokenizer(doc):
    return doc.split("|")

tfidf = TfidfVectorizer(tokenizer=pre_tokenized)
tfidf.fit(content)

where content is a list of bar-delmited strings

If anyone has any recommendations for a better way, or how to fix my broken
way I would really appreciate it!

traceback:

File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 1292, in transform

    X = super(TfidfVectorizer, self).transform(raw_documents)

  File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 855, in transform

    _, X = self._count_vocab(raw_documents, fixed_vocab=True)

  File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 741, in _count_vocab

    for feature in analyze(doc):

  File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 233, in <lambda>

    tokenize(preprocess(self.decode(doc))), stop_words)

  File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 199, in <lambda>

    return lambda x: strip_accents(x.lower())

AttributeError: 'list' object has no attribute 'lower'



-- 
Patrick Short
------------------------------

University of North Carolina at Chapel Hill, 2014

Applied Mathematics and Quantitative Biology

pjshor...@gmail.com | 919-455-7045 C

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] pre-tokenized data (not splitting on white space)

Reply via email to