Re: [Scikit-learn-general] Scikit-learn scalability options ?

Vinay B, Wed, 06 Feb 2013 14:26:27 -0800

Hi
Almost there (I hope) , but not quite:
I put my code up at https://gist.github.com/balamuru/4726232 for
readability. Reading a directory of text files in chunks of 5, and
returning them in a dictionary (key= filename, value= text contents)


I wanted to perform a clustering operation (haven't witten that part yet)
but from my output, it looks like I'm not incrementing the vectorizer. From
your previous response, were you thinking I was trying to classify the
output into predetermined categories?

See my 2 questions in the code i.e.
    #Question 1: I don't know class information because this is an
unsupervised learning (clustering) operation. Hence I can't perform a
partial_fit
    #Question2 : WRT Question 1, What should I be passing into the
clustering algorithm. I would first have to incrementally accumulate data
in the vectorizer

The output is
.
.
## counts: (10, 10000)
## counts: (10, 10000)
## counts: (10, 10000)
## counts: (10, 10000)
## counts: (5, 10000)
HashingVectorizer(analyzer=word, binary=False, charset=utf-8,
         charset_error=strict, dtype=<type 'numpy.float64'>, input=content,
         lowercase=True, n_features=10000, ngram_range=(1, 1),
         non_negative=False, norm=l2, preprocessor=None,
         stop_words=english, strip_accents=None,
         token_pattern=(?u)\b\w\w+\b, tokenizer=None)

Thanks
Vinay




On Wed, Feb 6, 2013 at 2:40 AM, Olivier Grisel <[email protected]>wrote:

> 2013/2/6 Vinay B, <[email protected]>:
> >
> > Hi Olivier,
> > Looking at the hashing vectorizer
> > (
> http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
> )
> > and how it is used , for example, in
> >
> http://scikit-learn.org/stable/auto_examples/document_clustering.html#example-document-clustering-py
> >
> > I'm trying to understand how this could be used scalably for large
> datasets
> > . From the example, as well as the stack overflow example, a collection
> of
> > documents will need to be passed to the vectorizer.
> >
> > eg.
> >
> > vectorizer = HashingVectorizer(n_features=opts.n_features,
> >                                        stop_words='english',
> >                                        non_negative=False, norm='l2',
> >                                        binary=False)
> > X = vectorizer.fit_transform(dataset.data)
> >
> >
> >
> > 1. How can this scale if the number of documents is large. In the example
> > above, the entire dataset is passed to the vectorizer
>
> Exactly by doing what I explained in the SO answer. By reading the
> content incrementally and updating the model by batches.
>
> > Regardless, could you kindly provide a simple example of how to read in
> > files from a directory to scikit. I dug through the twenty_newsgroups.py
> > (download_20newsgroup method) code and got lost, being a python newbie.
> >
> > For example, iterate through a directory
> >
> > def iter_documents(top_directory):
> >     """Iterate over all documents, yielding a document (=list of utf8
> > tokens) at a time."""
> >     for root, dirs, files in os.walk(top_directory):
> >         for file in filter(lambda file: file.endswith('.txt'), files):
> >             #print file
> >             document = open(os.path.join(root, file)).read() # read the
> > entire document, as one big string
>
> Don't look a this dataset loader, it is made far to complicated by the
> fact that we want to store it very efficiently in a compressed
> archive.
>
> Reading the text content of a collection of file is a regular python
> coding and nothing specific to sklearn. Just read the file content in
> small batches as list of string documents that fits in memory.
>
> If you have few files, you can fetch all their filenames in a list
> ahead of time and then:
>
> - read the content of the files for a couple of them at a time (for
> instance 100 files at a time) and put this in a temporary list,
> - feed the list of text content of this first batch to the
> vectorizer.transform method (no need to fit the HashingVectorizer),
> - read the class info for the individual files of this first batch
> (for instance if it's encoded in the filename or in the directory
> name) and encode then as integers (e.g. 0 for "spam", 1 for "ham"),
> - pass the vectorized content (a sparse matrix) + the label integer
> for the current batch to the partial_fit method of the classifier,
> - iterate to the next batch of files
>
> Beware that the ordering of the files should be made such as to have a
> representative proportion of each class in each batch (e.g. both
> positive and negative examples for a binary classification problem).
> This can be done by shuffling the filename list ahead of time.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn scalability options ?

Reply via email to