2014-09-09 3:36 GMT+02:00 Apu Mishra <[email protected]>: > Lars Buitinck <larsmans@...> writes: > >> The way to combine HV and >> Tfidf is >> >> hashing = HashingVectorizer(non_negative=True, norm=None) >> tfidf = TfidfTransformer() >> hashing_tfidf = Pipeline([("hashing", hashing), ("tidf", tfidf)]) >> > > I notice your use of the non_negative option in HashingVectorizer(), when > following hashing with TF-IDF. > > Since using non_negative eliminates some information, I am curious whether > there is any harm to allowing negative values as inputs to the TF-IDF > function. In the general case, feature values whether positive or negative > should simply scale up based on how document-infrequent they are, so I don't > see the harm of allowing negative values.
non_negative=True is a hack, and yes, it throws away information, and yes, I think we could define it for negative values by computing idf on the absolute values. It's just that no-one has done so. The first step would be to work out the repercussions: if a feature has zero value everywhere, it might have been seen, but thrown away by the hasher's collision resolving, so the df statistic is no longer reliable. Is that acceptable? Can we honestly call the output of this hack tf-idf? ------------------------------------------------------------------------------ Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
