Re: how load/group with large csv files

Stephan Ewen Tue, 21 Oct 2014 06:20:59 -0700

The POJO support should allow you to have a custom type with such many
fields, and then point to the relevant sorting fields.


Unfortunately, the pojo expression keys are not available in group sorting
as of today. Next version will solve it more elegantly...

On Tue, Oct 21, 2014 at 3:07 PM, Aljoscha Krettek <[email protected]>
wrote:

> By the way, do you actually need all those 54 columns in your job?
>
> On Tue, Oct 21, 2014 at 3:02 PM, Martin Neumann <[email protected]>
> wrote:
> > I will go with that workaround, however I would have preferred if I could
> > have done that directly with the API instead of doing Map/Reduce like
> > Key/Value tuples again :-)
> >
> > By the way is there a simple function to count the number of items in a
> > reduce group? It feels stupid to write a GroupReduce that just iterates
> and
> > increments a counter.
> >
> > cheers Martin
> >
> > On Tue, Oct 21, 2014 at 2:54 PM, Robert Metzger <[email protected]>
> wrote:
> >
> >> Yes, for sorted groups, you need to use Pojos or Tuples.
> >> I think you have to split the input lines manually, with a mapper.
> >> How about using a TupleN<...> with only the fields you need? (returned
> by
> >> the mapper)
> >>
> >> if you need all fields, you could also use a Tuple2<String, String[]>
> where
> >> the first position is the sort key?
> >>
> >>
> >>
> >> On Tue, Oct 21, 2014 at 2:20 PM, Gyula Fora <[email protected]> wrote:
> >>
> >> > I am not sure how you should go about that, let’s wait for some
> feedback
> >> > from the others.
> >> >
> >> > Until then you can always map the array to (array, keyfield) and use
> >> > groupBy(1).
> >> >
> >> >
> >> > > On 21 Oct 2014, at 14:17, Martin Neumann <[email protected]>
> wrote:
> >> > >
> >> > > Hej,
> >> > >
> >> > > Unfortunately .sort() cannot take a key extractor, would I have to
> do
> >> the
> >> > > sort myself then?
> >> > >
> >> > > cheers Martin
> >> > >
> >> > > On Tue, Oct 21, 2014 at 2:08 PM, Gyula Fora <[email protected]>
> wrote:
> >> > >
> >> > >> Hey,
> >> > >>
> >> > >> Using arrays is probably a convenient way to do so.
> >> > >>
> >> > >> I think the way you described the groupBy only works for tuples
> now.
> >> To
> >> > do
> >> > >> the grouping on the array field, you would need to create a key
> >> > extractor
> >> > >> for this and pass that to groupBy.
> >> > >>
> >> > >> Actually we have some use-cases like this for streaming so we are
> >> > thinking
> >> > >> of writing a wrapper for the array types that would behave as you
> >> > described.
> >> > >>
> >> > >> Regards,
> >> > >> Gyula
> >> > >>
> >> > >>> On 21 Oct 2014, at 14:03, Martin Neumann <[email protected]>
> >> wrote:
> >> > >>>
> >> > >>> Hej,
> >> > >>>
> >> > >>> I have a csv file with 54 columns each of them is string (for
> now). I
> >> > >> need
> >> > >>> to group and sort them on field 15.
> >> > >>>
> >> > >>> Whats the best way to load the data into Flink?
> >> > >>> There is no Tuple54 (and the <> would look awful anyway with 54
> times
> >> > >>> String in it).
> >> > >>> My current Idea is to write a Mapper and split the string to
> Arrays
> >> of
> >> > >>> Strings would grouping and sorting work on this?
> >> > >>>
> >> > >>> So can I do something like this or does that only work on tuples:
> >> > >>> Dataset<String[]> ds;
> >> > >>> ds.groupBy(15).sort(20. ANY)
> >> > >>>
> >> > >>> cheers Martin
> >> > >>
> >> > >>
> >> >
> >> >
> >>
>

Re: how load/group with large csv files

Reply via email to