2014-09-09 3:36 GMT+02:00 Apu Mishra <[email protected]>:
> Lars Buitinck <larsmans@...> writes:
>
>> The way to combine HV and
>> Tfidf is
>>
>> hashing = HashingVectorizer(non_negative=True, norm=None)
>> tfidf = TfidfTransformer()
>> hashing_tfidf = Pipeline([("hashing", hashing), ("tidf", tfidf)])
>>
>
> I notice your use of the non_negative option in HashingVectorizer(), when
> following hashing with TF-IDF.
>
> Since using non_negative eliminates some information, I am curious whether
> there is any harm to allowing negative values as inputs to the TF-IDF
> function. In the general case, feature values whether positive or negative
> should simply scale up based on how document-infrequent they are, so I don't
> see the harm of allowing negative values.

non_negative=True is a hack, and yes, it throws away information, and
yes, I think we could define it for negative values by computing idf
on the absolute values. It's just that no-one has done so. The first
step would be to work out the repercussions: if a feature has zero
value everywhere, it might have been seen, but thrown away by the
hasher's collision resolving, so the df statistic is no longer
reliable. Is that acceptable? Can we honestly call the output of this
hack tf-idf?

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to