Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/730#issuecomment-112824195
Thanks for your contribution @rbraeunlich et al. It looks very good. I had
some minor comments which you find in the code. While going through the code
and looking at scikit-learn I was wondering whether we shouldn't separate the
IDF from the TF part and realize them as two `Transformer`.
The TF transformer could be something like scikit-learn's `CountVectorizer`
or a the HashingTermFrequency transformer. The HashingTermFrequency transformer
could look like the code you've written for calculating the `dictionary`. This
tranformer will take care of generating the numeric vector representing the
term frequency.
The IDF transformer is then trained on the numeric vectors generated from
the corpus. The training means calculating the IDF values for the different
terms. You can then predict new documents by first transforming the document
using the TF transformer (giving a vector of term frequencies) and then apply
the IDF weighting.
What do you think?
We also updated the pipelining mechanism. Thus, it would be great if you
could rebase your PR on the latest master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---