2013/2/6 Vinay B, <vybe3...@gmail.com>: > > Hi Olivier, > Looking at the hashing vectorizer > (http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) > and how it is used , for example, in > http://scikit-learn.org/stable/auto_examples/document_clustering.html#example-document-clustering-py > > I'm trying to understand how this could be used scalably for large datasets > . From the example, as well as the stack overflow example, a collection of > documents will need to be passed to the vectorizer. > > eg. > > vectorizer = HashingVectorizer(n_features=opts.n_features, > stop_words='english', > non_negative=False, norm='l2', > binary=False) > X = vectorizer.fit_transform(dataset.data) > > > > 1. How can this scale if the number of documents is large. In the example > above, the entire dataset is passed to the vectorizer
Exactly by doing what I explained in the SO answer. By reading the content incrementally and updating the model by batches. > Regardless, could you kindly provide a simple example of how to read in > files from a directory to scikit. I dug through the twenty_newsgroups.py > (download_20newsgroup method) code and got lost, being a python newbie. > > For example, iterate through a directory > > def iter_documents(top_directory): > """Iterate over all documents, yielding a document (=list of utf8 > tokens) at a time.""" > for root, dirs, files in os.walk(top_directory): > for file in filter(lambda file: file.endswith('.txt'), files): > #print file > document = open(os.path.join(root, file)).read() # read the > entire document, as one big string Don't look a this dataset loader, it is made far to complicated by the fact that we want to store it very efficiently in a compressed archive. Reading the text content of a collection of file is a regular python coding and nothing specific to sklearn. Just read the file content in small batches as list of string documents that fits in memory. If you have few files, you can fetch all their filenames in a list ahead of time and then: - read the content of the files for a couple of them at a time (for instance 100 files at a time) and put this in a temporary list, - feed the list of text content of this first batch to the vectorizer.transform method (no need to fit the HashingVectorizer), - read the class info for the individual files of this first batch (for instance if it's encoded in the filename or in the directory name) and encode then as integers (e.g. 0 for "spam", 1 for "ham"), - pass the vectorized content (a sparse matrix) + the label integer for the current batch to the partial_fit method of the classifier, - iterate to the next batch of files Beware that the ordering of the files should be made such as to have a representative proportion of each class in each batch (e.g. both positive and negative examples for a binary classification problem). This can be done by shuffling the filename list ahead of time. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general