2013/2/6 Vinay B, <vybe3...@gmail.com>:
>
> Hi Olivier,
> Looking at the hashing vectorizer
> (http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)
> and how it is used , for example, in
> http://scikit-learn.org/stable/auto_examples/document_clustering.html#example-document-clustering-py
>
> I'm trying to understand how this could be used scalably for large datasets
> . From the example, as well as the stack overflow example, a collection of
> documents will need to be passed to the vectorizer.
>
> eg.
>
> vectorizer = HashingVectorizer(n_features=opts.n_features,
>                                        stop_words='english',
>                                        non_negative=False, norm='l2',
>                                        binary=False)
> X = vectorizer.fit_transform(dataset.data)
>
>
>
> 1. How can this scale if the number of documents is large. In the example
> above, the entire dataset is passed to the vectorizer

Exactly by doing what I explained in the SO answer. By reading the
content incrementally and updating the model by batches.

> Regardless, could you kindly provide a simple example of how to read in
> files from a directory to scikit. I dug through the twenty_newsgroups.py
> (download_20newsgroup method) code and got lost, being a python newbie.
>
> For example, iterate through a directory
>
> def iter_documents(top_directory):
>     """Iterate over all documents, yielding a document (=list of utf8
> tokens) at a time."""
>     for root, dirs, files in os.walk(top_directory):
>         for file in filter(lambda file: file.endswith('.txt'), files):
>             #print file
>             document = open(os.path.join(root, file)).read() # read the
> entire document, as one big string

Don't look a this dataset loader, it is made far to complicated by the
fact that we want to store it very efficiently in a compressed
archive.

Reading the text content of a collection of file is a regular python
coding and nothing specific to sklearn. Just read the file content in
small batches as list of string documents that fits in memory.

If you have few files, you can fetch all their filenames in a list
ahead of time and then:

- read the content of the files for a couple of them at a time (for
instance 100 files at a time) and put this in a temporary list,
- feed the list of text content of this first batch to the
vectorizer.transform method (no need to fit the HashingVectorizer),
- read the class info for the individual files of this first batch
(for instance if it's encoded in the filename or in the directory
name) and encode then as integers (e.g. 0 for "spam", 1 for "ham"),
- pass the vectorized content (a sparse matrix) + the label integer
for the current batch to the partial_fit method of the classifier,
- iterate to the next batch of files

Beware that the ordering of the files should be made such as to have a
representative proportion of each class in each batch (e.g. both
positive and negative examples for a binary classification problem).
This can be done by shuffling the filename list ahead of time.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to