Re: Any better way to ensure unicity ?

hc busy Fri, 16 Jul 2010 14:25:49 -0700

Hmmm.... I didn't know you can go crazy like that. I take back that I said
Pig is not concise and inelegant.


Oh, there isn't a way to extend the language. I mean, unless while I wasn't
looking those additional "#define" syntaxes has been implemented already.
And I still think recursive functions are a must to have.



On Thu, Jul 15, 2010 at 8:32 PM, Mridul Muralidharan
<[email protected]>wrote:

>
> It is more about maintaining yet another udf which duplicates functionality
> which is done by the base language ...
> So tradeoff is between using a language construct (which might be optimized
> internally) versus writing extension code.
>
> Mridul
>
>
> On Friday 16 July 2010 04:03 AM, hc busy wrote:
>
>> LIMIT is an extra line to type. But I guess if we're using pig, we don't
>> really care for elegance and concision huh?
>>
>>
>> On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy<[email protected]>
>>  wrote:
>>
>>  hc, two things about that approach :
>>>
>>> 1) if you use the accumulator interface, the bag won't be materialized
>>> 2) am I missing something? Why can't you just use LIMIT 1?
>>>
>>> -D
>>>
>>> On Wed, Jul 14, 2010 at 10:39 AM, hc busy<[email protected]>  wrote:
>>>
>>>  Write a UDF called
>>>>
>>>> takeOne()
>>>>
>>>> that takes the first thing from the bag and returns it. The only problem
>>>> that I'm having is that this UDF cannot signal to pig that it is done.
>>>> So
>>>> that whole bag is always created in it's entirety.
>>>>
>>>>
>>>> Btw, this UDF will be able to accomplish the same task (picking out one
>>>> item
>>>> out fo a bag)
>>>>
>>>> https://issues.apache.org/jira/browse/PIG-1386
>>>>
>>>> because MaxTupleByNthField extends the original MaxTupleBy1stField by
>>>> allowing you to specify any column in the tuple as the comparison key.
>>>>
>>> And
>>>
>>>> because it handles typing correctly, your schema will be as you expect
>>>> automatically.
>>>>
>>>> sessions = GROUP sessions BY sid;
>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>> FLATTEN(first);};
>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I have
>>>> in
>>>> the session table...
>>>>
>>>>
>>>> is replaced with
>>>>
>>>> session = GROUP session by sid;
>>>> session = FOREACH session generate MaxTupleByNthField(session);
>>>>
>>>> that's it. it'll have the right schema, all columns from before, but
>>>>
>>> choses
>>>
>>>> one of the data points.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey<[email protected]
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  I run into this situation all the time.  You have to do a foreach ...
>>>>> generate projection at the end to rename everything.
>>>>>
>>>>> The way aliases work in pig, you quite often have to do 'renaming only'
>>>>> projections if you don't want to make other bits of code later change:
>>>>> After the group and limit:
>>>>>
>>>>> sessions = FOREACH sessions GENERATE field1 as field1, field2 as
>>>>>
>>>> field2,
>>>
>>>> field3 ad field3 . . .
>>>>>
>>>>> That will get rid of the :: prefixes and make the alias shareable with
>>>>> later pig code and not dependent on what you do in the group to filter
>>>>>
>>>> data.
>>>>
>>>>>
>>>>>
>>>>> On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>>>>>
>>>>>   Actually you are right: the schema is the same, nevertheless, the
>>>>>> "naming" of the various columns in the schema is modified, and thus
>>>>>> my subsequent operations fail:
>>>>>>
>>>>>> original schema:
>>>>>> sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
>>>>>>
>>>>> long}
>>>>
>>>>>
>>>>>> modified schema:
>>>>>> sessions: {first::sid: chararray,first::infoid:
>>>>>> chararray,first::imei: chararray,first::start: long}
>>>>>>
>>>>>> Do you know a workaround ?
>>>>>>
>>>>>> Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>>>>>>
>>>>>>>
>>>>>>> The flatten will return the same schema as before (in 'first') :
>>>>>>> so unless you are modifying the fields or the order in which they
>>>>>>> are generated (which I dont think you are in view of your comment
>>>>>>> that it should work with and without this), you can simply go with :
>>>>>>>
>>>>>>> -- Or whatever works for you.
>>>>>>> %define PARALLELISM        '10'
>>>>>>>
>>>>>>> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>>>>>>>
>>>>>>> OR
>>>>>>>
>>>>>>> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>>> FLATTEN(first);};
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The schema at the end would be exactly same as start of the code
>>>>>>> snippet for 'sessions'.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure what you mean here exactly.
>>>>>>>>> Will a sid row have multiple (different) values for the other
>>>>>>>>> fields ?
>>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> But if you want to pick any one row for a given sid, then I think
>>>>>>>>> what you have below might be good enough (you can omit the last
>>>>>>>>> line though).
>>>>>>>>>
>>>>>>>> OK. Thanks. The last line is used to retrieve the exact same data
>>>>>>>> structure and naming as the original table. This way, I can
>>>>>>>> optionally perform this treatment without modifying my code. If you
>>>>>>>> know a better way...
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mridul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>>>>>>>>>
>>>>>>>>>>    Hello everybody,
>>>>>>>>>>
>>>>>>>>>> I have a simple table containing sessions. Each sessions has an
>>>>>>>>>> unique key (the sid, which is actually a uuid).
>>>>>>>>>> But a session can be present several times in my input table.
>>>>>>>>>>
>>>>>>>>>> I want to ensure that I only have 1 record for each sid (because
>>>>>>>>>>
>>>>>>>>> I
>>>
>>>> perform subsequent JOIN based on this sid).
>>>>>>>>>>
>>>>>>>>>> Currently I use the following script, but I wonder if there is
>>>>>>>>>> something more efficient:
>>>>>>>>>>
>>>>>>>>>> sessions = GROUP sessions BY sid;
>>>>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>>>>>>>>>> FLATTEN(first);};
>>>>>>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>>>>>>>>>> have in the session table...
>>>>>>>>>>
>>>>>>>>>> Do you see any optimization I can do, especially on the FLATTEN /
>>>>>>>>>> GENERATE part ?
>>>>>>>>>>
>>>>>>>>>> Thank you very much for your help.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: Any better way to ensure unicity ?

Reply via email to