[ https://issues.apache.org/jira/browse/FLINK-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585871#comment-14585871 ]
ASF GitHub Bot commented on FLINK-2152: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/832#discussion_r32415490 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.java.utils; + +import org.apache.flink.api.common.functions.RichMapPartitionFunction; +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.util.Collector; + +import java.util.Collections; +import java.util.Comparator; +import java.util.List; + +/** + * This class provides simple utility methods for zipping elements in a file with an index. + * + * @param <T> The type of the DataSet, i.e., the type of the elements of the DataSet. + */ +public class DataSetUtils<T> { + + /** + * Method that goes over all the elements in each partition in order to retireve + * the total number of elements. + * + * @param input the DataSet received as input + * @return a data set containing tuples of subtask index, number of elements mappings. + */ + public DataSet<Tuple2<Integer, Long>> countElements(DataSet<T> input) { + return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Integer,Long>>() { + @Override + public void mapPartition(Iterable<T> values, Collector<Tuple2<Integer, Long>> out) throws Exception { + long counter = 0; + for(T value: values) { + counter ++; + } + + out.collect(new Tuple2<Integer, Long>(getRuntimeContext().getIndexOfThisSubtask(), counter)); + } + }); + } + + /** + * Method that takes a set of subtask index, total number of elements mappings + * and assigns ids to all the elements from the input data set. + * + * @param input the input data set + * @return a data set of tuple 2 consisting of consecutive ids and initial values. + */ + public DataSet<Tuple2<Long, T>> zipWithIndex(DataSet<T> input) { --- End diff -- But you still would have to import the implicit cast for the pimp my library class to make it usable. Thus it is only syntactic sugar for writing something like `DataSetUtils.zipWithIndex(ds)`. > Provide zipWithIndex utility in flink-contrib > --------------------------------------------- > > Key: FLINK-2152 > URL: https://issues.apache.org/jira/browse/FLINK-2152 > Project: Flink > Issue Type: Improvement > Components: Java API > Reporter: Robert Metzger > Assignee: Andra Lungu > Priority: Trivial > Labels: starter > > We should provide a simple utility method for zipping elements in a data set > with a dense index. > its up for discussion whether we want it directly in the API or if we should > provide it only as a utility from {{flink-contrib}}. > I would put it in {{flink-contrib}}. > See my answer on SO: > http://stackoverflow.com/questions/30596556/zipwithindex-on-apache-flink -- This message was sent by Atlassian JIRA (v6.3.4#6332)