[jira] [Commented] (FLINK-4964) FlinkML - Add StringIndexer

ASF GitHub Bot (JIRA) Tue, 15 Nov 2016 02:45:26 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15666809#comment-15666809
 ]


ASF GitHub Bot commented on FLINK-4964:
---------------------------------------

Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2740
  
    Hello @tfournier314 I tested your code and it does seem that partitions are 
sorted
    only internally, which is expected and `zipWithIndex` is AFAIK unaware of 
the sorted (as in key range) order of partitions, so it's not guaranteed that 
the "first" partition will get the `[0, partitionSize-1]` indices, the second 
`[partitionSize, 2*partitionSize]` etc. Maybe @greghogan knows a solution for 
global sorting?
    
    If it's not possible I think we can take a step back and see what we are 
trying to achieve here.
    
    The task is to count the frequencies of labels and assign integer ids to 
them in order of frequency. The labels should either be categorical variables 
(e.g. {Male, Female, Uknown}) or class labels. The case with the most unique 
values might be vocabulary words, which will range in the few million unique 
values at worst.
    
    I would argue then than after we have performed the frequency count in a 
distributed manner there is no need to do the last step which is assigning 
ordered indices in a distributed manner as well, we can make the assumption 
that all the (label -> frequency) values should fit into the memory of one 
machine.
    
    So I would recommend to gather all data into one partition after getting 
the counts, that way we guarantee a global ordering:
    
    ```{Scala}
    fitData.map(s => (s,1))
          .groupBy(0)
          .sum(1)
          .partitionByRange(x => 0)
          .sortPartition(1, Order.DESCENDING)
          .zipWithIndex
          .print()
    ```
    
    Of course we would need to clarify this restriction in the docstrings and 
documentation.


> FlinkML - Add StringIndexer
> ---------------------------
>
>                 Key: FLINK-4964
>                 URL: https://issues.apache.org/jira/browse/FLINK-4964
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Thomas FOURNIER
>            Priority: Minor
>
> Add StringIndexer as described here:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
> This will be added in package preprocessing of FlinkML



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4964) FlinkML - Add StringIndexer

Reply via email to