Hi Olivier,
Looking at the hashing vectorizer (
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)
and how it is used , for example, in
http://scikit-learn.org/stable/auto_examples/document_clustering.html#example-document-clustering-py
I'm trying to understand how this could be used scalably for large datasets
. From the example, as well as the stack overflow example, a collection of
documents will need to be passed to the vectorizer.
eg.
vectorizer = HashingVectorizer(n_features=opts.n_features,
stop_words='english',
non_negative=False, norm='l2',
binary=False)X =
vectorizer.fit_transform(dataset.data)
1. How can this scale if the number of documents is large. In the example
above, the entire dataset is passed to the vectorizer
2. From the HashingVectorizer documentation
> *
> Convert a collection of text documents to a matrix of token occurrences
> ====================>my text documents will be in a directory tree
> It turns a collection of text documents into a scipy.sparse matrix holding
> token occurrence counts (or binary occurrence information), possibly
> normalized as token frequencies if norm=’l1’ or projected on the euclidean
> unit sphere if norm=’l2’.
> This text vectorizer implementation uses the hashing trick to find the
> token string name to feature integer index mapping.
> This strategy has several advantage:*
>
> - *it is very low memory scalable to large datasets as there is no
> need to store a vocabulary dictionary in memory*
>
>
> - *it is fast to pickle and un-pickle has it holds no state besides
> the constructor parameters*
>
>
> - *it can be used in a streaming (partial fit) or parallel pipeline as
> there is no state computed during fit. =============> does this mean that
> we can feed the docs, one at a time as i iterate across my document tree
> *
>
>
Regardless, could you kindly provide a simple example of how to read in
files from a directory to scikit. I dug through the twenty_newsgroups.py
(download_20newsgroup method) code and got lost, being a python newbie.
For example, iterate through a directory
def iter_documents(top_directory):
"""Iterate over all documents, yielding a document (=list of utf8
tokens) at a time."""
for root, dirs, files in os.walk(top_directory):
for file in filter(lambda file: file.endswith('.txt'), files):
#print file
document = open(os.path.join(root, file)).read() # read the
entire document, as one big string
##### <===============HOW CAN I HANDLE THE INDIVIDUAL FILE STRING, .. SO
IT CAN BE CONSUMED BY THE HASHING VECTOR ==========> ############
Please correct me if I somehow misunderstand.
Thanks
*Re: [Scikit-learn-general] Scikit-learn scalability options
?<http://sourceforge.net/mailarchive/message.php?msg_id=30446769>
> *
> From: Olivier Grisel <olivier.grisel@en...> - 2013-02-05 16:56
> Please have a look at the following SO answer:
>
> http://stackoverflow.com/questions/12460077/possibility-to-apply-online-algorithms-on-big-data-files-with-sklearn/12460918#12460918
> Note that large scale learning although requires large scale feature
> extraction. The recently release 0.14 version includes the
> sklearn.feature_extraction.text.HashingVectorizer for text and
> sklearn.feature_extraction.FeatureHasher for streams of categorical
> data (e.g. a list of python dicts).
> Have a look at the documentation here:
> http://scikit-learn.org/dev/modules/feature_extraction.html
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
On Tue, Feb 5, 2013 at 7:48 AM, Vinay B, <[email protected]> wrote:
> Hi,
> From my newbie experiments last week, it appears that scikit loads all
> documents into memory (classification(training & testing) and
> clustering. This approach might not scale to the millions of (text)
> docs that I want to process.
>
> 1. Is there a recommended way to deal with large datasets? Examples ?
> 2. I've also been looking at gensim, which offers a memory efficient
> way to ingest large datasets. Is there a way to
> a. use the same approach with scikit AND / OR
> b. Use the gensim models with scikit's clustering and classification
> capabilities?
>
>
> Thanks for all the hep so far. It has been very useful.
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general