Re: how load/group with large csv files

Aljoscha Krettek Tue, 21 Oct 2014 06:10:04 -0700

By the way, do you actually need all those 54 columns in your job?


On Tue, Oct 21, 2014 at 3:02 PM, Martin Neumann <[email protected]> wrote:
> I will go with that workaround, however I would have preferred if I could
> have done that directly with the API instead of doing Map/Reduce like
> Key/Value tuples again :-)
>
> By the way is there a simple function to count the number of items in a
> reduce group? It feels stupid to write a GroupReduce that just iterates and
> increments a counter.
>
> cheers Martin
>
> On Tue, Oct 21, 2014 at 2:54 PM, Robert Metzger <[email protected]> wrote:
>
>> Yes, for sorted groups, you need to use Pojos or Tuples.
>> I think you have to split the input lines manually, with a mapper.
>> How about using a TupleN<...> with only the fields you need? (returned by
>> the mapper)
>>
>> if you need all fields, you could also use a Tuple2<String, String[]> where
>> the first position is the sort key?
>>
>>
>>
>> On Tue, Oct 21, 2014 at 2:20 PM, Gyula Fora <[email protected]> wrote:
>>
>> > I am not sure how you should go about that, let’s wait for some feedback
>> > from the others.
>> >
>> > Until then you can always map the array to (array, keyfield) and use
>> > groupBy(1).
>> >
>> >
>> > > On 21 Oct 2014, at 14:17, Martin Neumann <[email protected]> wrote:
>> > >
>> > > Hej,
>> > >
>> > > Unfortunately .sort() cannot take a key extractor, would I have to do
>> the
>> > > sort myself then?
>> > >
>> > > cheers Martin
>> > >
>> > > On Tue, Oct 21, 2014 at 2:08 PM, Gyula Fora <[email protected]> wrote:
>> > >
>> > >> Hey,
>> > >>
>> > >> Using arrays is probably a convenient way to do so.
>> > >>
>> > >> I think the way you described the groupBy only works for tuples now.
>> To
>> > do
>> > >> the grouping on the array field, you would need to create a key
>> > extractor
>> > >> for this and pass that to groupBy.
>> > >>
>> > >> Actually we have some use-cases like this for streaming so we are
>> > thinking
>> > >> of writing a wrapper for the array types that would behave as you
>> > described.
>> > >>
>> > >> Regards,
>> > >> Gyula
>> > >>
>> > >>> On 21 Oct 2014, at 14:03, Martin Neumann <[email protected]>
>> wrote:
>> > >>>
>> > >>> Hej,
>> > >>>
>> > >>> I have a csv file with 54 columns each of them is string (for now). I
>> > >> need
>> > >>> to group and sort them on field 15.
>> > >>>
>> > >>> Whats the best way to load the data into Flink?
>> > >>> There is no Tuple54 (and the <> would look awful anyway with 54 times
>> > >>> String in it).
>> > >>> My current Idea is to write a Mapper and split the string to Arrays
>> of
>> > >>> Strings would grouping and sorting work on this?
>> > >>>
>> > >>> So can I do something like this or does that only work on tuples:
>> > >>> Dataset<String[]> ds;
>> > >>> ds.groupBy(15).sort(20. ANY)
>> > >>>
>> > >>> cheers Martin
>> > >>
>> > >>
>> >
>> >
>>

Re: how load/group with large csv files

Reply via email to