Nope, but I cant filter out the useless data since the program I'm comparing to does not either. The point is to prove to one of my Colleague that Flink > Spark. The Spark program runs out of memory and crashes when just doing a simple group and counting the number of items.
This is also one of the reasons I ask for what is the best style of doing this so I can get it as clean as possible to compare it to Spark. cheers Martin On Tue, Oct 21, 2014 at 3:07 PM, Aljoscha Krettek <aljos...@apache.org> wrote: > By the way, do you actually need all those 54 columns in your job? > > On Tue, Oct 21, 2014 at 3:02 PM, Martin Neumann <mneum...@spotify.com> > wrote: > > I will go with that workaround, however I would have preferred if I could > > have done that directly with the API instead of doing Map/Reduce like > > Key/Value tuples again :-) > > > > By the way is there a simple function to count the number of items in a > > reduce group? It feels stupid to write a GroupReduce that just iterates > and > > increments a counter. > > > > cheers Martin > > > > On Tue, Oct 21, 2014 at 2:54 PM, Robert Metzger <rmetz...@apache.org> > wrote: > > > >> Yes, for sorted groups, you need to use Pojos or Tuples. > >> I think you have to split the input lines manually, with a mapper. > >> How about using a TupleN<...> with only the fields you need? (returned > by > >> the mapper) > >> > >> if you need all fields, you could also use a Tuple2<String, String[]> > where > >> the first position is the sort key? > >> > >> > >> > >> On Tue, Oct 21, 2014 at 2:20 PM, Gyula Fora <gyf...@apache.org> wrote: > >> > >> > I am not sure how you should go about that, let’s wait for some > feedback > >> > from the others. > >> > > >> > Until then you can always map the array to (array, keyfield) and use > >> > groupBy(1). > >> > > >> > > >> > > On 21 Oct 2014, at 14:17, Martin Neumann <mneum...@spotify.com> > wrote: > >> > > > >> > > Hej, > >> > > > >> > > Unfortunately .sort() cannot take a key extractor, would I have to > do > >> the > >> > > sort myself then? > >> > > > >> > > cheers Martin > >> > > > >> > > On Tue, Oct 21, 2014 at 2:08 PM, Gyula Fora <gyf...@apache.org> > wrote: > >> > > > >> > >> Hey, > >> > >> > >> > >> Using arrays is probably a convenient way to do so. > >> > >> > >> > >> I think the way you described the groupBy only works for tuples > now. > >> To > >> > do > >> > >> the grouping on the array field, you would need to create a key > >> > extractor > >> > >> for this and pass that to groupBy. > >> > >> > >> > >> Actually we have some use-cases like this for streaming so we are > >> > thinking > >> > >> of writing a wrapper for the array types that would behave as you > >> > described. > >> > >> > >> > >> Regards, > >> > >> Gyula > >> > >> > >> > >>> On 21 Oct 2014, at 14:03, Martin Neumann <mneum...@spotify.com> > >> wrote: > >> > >>> > >> > >>> Hej, > >> > >>> > >> > >>> I have a csv file with 54 columns each of them is string (for > now). I > >> > >> need > >> > >>> to group and sort them on field 15. > >> > >>> > >> > >>> Whats the best way to load the data into Flink? > >> > >>> There is no Tuple54 (and the <> would look awful anyway with 54 > times > >> > >>> String in it). > >> > >>> My current Idea is to write a Mapper and split the string to > Arrays > >> of > >> > >>> Strings would grouping and sorting work on this? > >> > >>> > >> > >>> So can I do something like this or does that only work on tuples: > >> > >>> Dataset<String[]> ds; > >> > >>> ds.groupBy(15).sort(20. ANY) > >> > >>> > >> > >>> cheers Martin > >> > >> > >> > >> > >> > > >> > > >> >