Oh, I just figured, it's the max value for term_id. Sorry to disturb you ;)
Cheers On Thu, Sep 8, 2016 at 8:40 PM, klo uo <[email protected]> wrote: > > ---------- Forwarded message ---------- > From: klo uo <[email protected]> > Date: Thu, Sep 8, 2016 at 8:25 PM > Subject: Loading file in libsvm format > To: [email protected] > > > Hi, > > I produced a file in libsvm format: > > <label> <index1>:<value1> <index2>:<value2> ... > > with this content: > > 6284 576:1 884:1 2482:1 4279:1 5765:1 184552:1 661512:1 699842:1 > 2259 1669:1 5711528:6 > 2822 5765159:1 > ... > > The label is document_id, and index:value are term_id and term count. > > This file has 83K labels with 40K unique terms (and overall 1.2M > index:value pairs). > > When I load this file in sklearn: > > from sklearn.datasets import load_svmlight_file > X, y = load_svmlight_file('libsim.txt') > > I get X with shape (82448, 6092168). > > I don't know of any reason why am I getting 6M features? > Can someone explain? > > > Thanks > > > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
