Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/730#issuecomment-112824195
  
    Thanks for your contribution @rbraeunlich et al. It looks very good. I had 
some minor comments which you find in the code. While going through the code 
and looking at scikit-learn I was wondering whether we shouldn't separate the 
IDF from the TF part and realize them as two `Transformer`.
    
    The TF transformer could be something like scikit-learn's `CountVectorizer` 
or a the HashingTermFrequency transformer. The HashingTermFrequency transformer 
could look like the code you've written for calculating the `dictionary`. This 
tranformer will take care of generating the numeric vector representing the 
term frequency.
    
    The IDF transformer is then trained on the numeric vectors generated from 
the corpus. The training means calculating the IDF values for the different 
terms. You can then predict new documents by first transforming the document 
using the TF transformer (giving a vector of term frequencies) and then apply 
the IDF weighting.
    
    What do you think?
    
    We also updated the pipelining mechanism. Thus, it would be great if you 
could rebase your PR on the latest master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to