GitHub user rbraeunlich opened a pull request:
https://github.com/apache/flink/pull/730
basic TfidfTransformer
Hi everybody,
due to [Flink-1999](https://issues.apache.org/jira/browse/FLINK-1999) we
created a first implementation of a TfIdfTranformer.
There is still one problem left, because using modulo after the hashing
causes collisions.
Nevertheless, we would be glad to receive some comments to our
implementation.
Cheers,
Ronny
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rbraeunlich/flink tfidf
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/730.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #730
----
commit 9e9ac219b619ddfbab4f616165d038900b7726db
Author: Ronny Bräunlich <[email protected]>
Date: 2015-05-15T09:18:00Z
create TfIdfTransformer
commit 42ef7c00a832e21d7391e1011031bda162d930f1
Author: Ronny Bräunlich <[email protected]>
Date: 2015-05-16T14:38:28Z
fix import in TfIdfTranformer and add first basic test case
commit 82385b764f45f955cd88590b7657467689d096ed
Author: Ronny Bräunlich <[email protected]>
Date: 2015-05-15T09:18:00Z
create TfIdfTransformer and add first basic test case
commit 7242728b1c24027203f1ff91476de9acb9bbf3a7
Author: diva1012 <[email protected]>
Date: 2015-05-17T11:42:40Z
Changes merged
Merge remote-tracking branch 'rbraeunlich/tfidf' into tfidf
Conflicts:
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala
commit 9c2c181624bb81f3ed83a4a774339251508644f1
Author: diva1012 <[email protected]>
Date: 2015-05-17T17:40:00Z
Small fix of the test class. (The Sparse vector contains index -> value
tuples, so we have to take only the value and not the whole tuple for the
comparisson)
commit 8b17385e34b7f139a2649f80edc81744277fcfae
Author: diva1012 <[email protected]>
Date: 2015-05-18T06:41:58Z
Word count implementation simplified.
commit 229fac5f835ce05dd03544f7dd7c0df7952f18e9
Author: diva1012 <[email protected]>
Date: 2015-05-18T11:35:43Z
TF calculation fixed
commit e1ea4437e42860d8ed7820c32e08d7a2d1152b08
Author: diva1012 <[email protected]>
Date: 2015-05-19T20:44:31Z
Transformer improved: now we get SparseVector for each document that
contains all words.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---