Hi Olivier,
Looking at the hashing vectorizer (
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)
and how it is used , for example, in
http://scikit-learn.org/stable/auto_examples/document_clustering.html#example-document-clustering-py


I'm trying to understand how this could be used scalably for large datasets
. From the example, as well as the stack overflow example, a collection of
documents will need to be passed to the vectorizer.

eg.

vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       non_negative=False, norm='l2',
                                       binary=False)X =
vectorizer.fit_transform(dataset.data)



1. How can this scale if the number of documents is large. In the example
above, the entire dataset is passed to the vectorizer
2. From the HashingVectorizer documentation

> *
> Convert a collection of text documents to a matrix of token occurrences
> ====================>my text documents will be in a directory tree
> It turns a collection of text documents into a scipy.sparse matrix holding
> token occurrence counts (or binary occurrence information), possibly
> normalized as token frequencies if norm=’l1’ or projected on the euclidean
> unit sphere if norm=’l2’.
> This text vectorizer implementation uses the hashing trick to find the
> token string name to feature integer index mapping.
> This strategy has several advantage:*
>
>    - *it is very low memory scalable to large datasets as there is no
>    need to store a vocabulary dictionary in memory*
>
>
>    - *it is fast to pickle and un-pickle has it holds no state besides
>    the constructor parameters*
>
>
>    - *it can be used in a streaming (partial fit) or parallel pipeline as
>    there is no state computed during fit. =============> does this mean that
>    we can feed the docs, one at a time as i iterate across my document tree
>    *
>
>
Regardless, could you kindly provide a simple example of how to read in
files from a directory to scikit. I dug through the twenty_newsgroups.py
 (download_20newsgroup method) code and got lost, being a python newbie.

For example, iterate through a directory

def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8
tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: file.endswith('.txt'), files):
            #print file
            document = open(os.path.join(root, file)).read() # read the
entire document, as one big string
#####  <===============HOW CAN I HANDLE THE INDIVIDUAL FILE STRING, .. SO
IT CAN BE CONSUMED BY THE HASHING VECTOR ==========> ############

Please correct me if I somehow misunderstand.
Thanks

*Re: [Scikit-learn-general] Scikit-learn scalability options
?<http://sourceforge.net/mailarchive/message.php?msg_id=30446769>
> *
> From: Olivier Grisel <olivier.grisel@en...> - 2013-02-05 16:56
> Please have a look at the following SO answer:
>
> http://stackoverflow.com/questions/12460077/possibility-to-apply-online-algorithms-on-big-data-files-with-sklearn/12460918#12460918
> Note that large scale learning although requires large scale feature
> extraction. The recently release 0.14 version includes the
> sklearn.feature_extraction.text.HashingVectorizer for text and
> sklearn.feature_extraction.FeatureHasher for streams of categorical
> data (e.g. a list of python dicts).
> Have a look at the documentation here:
> http://scikit-learn.org/dev/modules/feature_extraction.html
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel


On Tue, Feb 5, 2013 at 7:48 AM, Vinay B, <[email protected]> wrote:
> Hi,
> From my newbie experiments last week, it appears that scikit loads all
> documents into memory (classification(training & testing) and
> clustering. This approach might not scale to the millions of (text)
> docs that I want to process.
>
> 1. Is there a recommended way to deal with large datasets? Examples ?
> 2. I've also been looking at gensim, which offers  a memory efficient
> way to ingest large datasets. Is there a way to
> a. use the same approach with scikit AND / OR
> b. Use the gensim models with scikit's clustering and classification
> capabilities?
>
>
> Thanks for all the hep so far. It has been very useful.
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to