[
https://issues.apache.org/jira/browse/FLINK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15666809#comment-15666809
]
ASF GitHub Bot commented on FLINK-4964:
---------------------------------------
Github user thvasilo commented on the issue:
https://github.com/apache/flink/pull/2740
Hello @tfournier314 I tested your code and it does seem that partitions are
sorted
only internally, which is expected and `zipWithIndex` is AFAIK unaware of
the sorted (as in key range) order of partitions, so it's not guaranteed that
the "first" partition will get the `[0, partitionSize-1]` indices, the second
`[partitionSize, 2*partitionSize]` etc. Maybe @greghogan knows a solution for
global sorting?
If it's not possible I think we can take a step back and see what we are
trying to achieve here.
The task is to count the frequencies of labels and assign integer ids to
them in order of frequency. The labels should either be categorical variables
(e.g. {Male, Female, Uknown}) or class labels. The case with the most unique
values might be vocabulary words, which will range in the few million unique
values at worst.
I would argue then than after we have performed the frequency count in a
distributed manner there is no need to do the last step which is assigning
ordered indices in a distributed manner as well, we can make the assumption
that all the (label -> frequency) values should fit into the memory of one
machine.
So I would recommend to gather all data into one partition after getting
the counts, that way we guarantee a global ordering:
```{Scala}
fitData.map(s => (s,1))
.groupBy(0)
.sum(1)
.partitionByRange(x => 0)
.sortPartition(1, Order.DESCENDING)
.zipWithIndex
.print()
```
Of course we would need to clarify this restriction in the docstrings and
documentation.
> FlinkML - Add StringIndexer
> ---------------------------
>
> Key: FLINK-4964
> URL: https://issues.apache.org/jira/browse/FLINK-4964
> Project: Flink
> Issue Type: New Feature
> Reporter: Thomas FOURNIER
> Priority: Minor
>
> Add StringIndexer as described here:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
> This will be added in package preprocessing of FlinkML
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)