RE: Some newbie questions

Richard Ding Thu, 24 Dec 2009 12:31:04 -0800

Which version of Pig you're using? We've fixed many issues of multi-query since 
the feature was released. Please try the queries again with the latest version 
and let us know any problems.


Thanks,
-Richard

-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]] 
Sent: Thursday, December 24, 2009 10:45 AM
To: [email protected]
Subject: Re: Some newbie questions


That is not exactly the same as what I proposed - not just from a 
performance/implementation point of view, but probably also from a code 
reuse point of view. But then I am not very familiar with multi-query 
work much because of the stability issues I face when using it - so 
probably it is comparable !

Ofcourse, to each his own :-)

Regards,
Mridul

Richard Ding wrote:
> Pig supports queries that have multiple outputs (multi-query support). 
> 
> Thanks,
> -Richard
> 
> -----Original Message-----
> From: Mridul Muralidharan [mailto:[email protected]] 
> Sent: Thursday, December 24, 2009 10:10 AM
> To: [email protected]
> Subject: Re: Some newbie questions
> 
> 
> IIRC pig does not support multiple output collector, does it ?
> And lack of common schema in this case (each type has its own schema) is 
> worrying.
> 
> Regards,
> Mridul
> 
> Richard Ding wrote:
>> Actually, you don't need to use Hadoop to create this map-only job, Pig will 
>> do it for you. 
>>
>> Thanks,
>> -Richard
>> -----Original Message-----
>> From: Mridul Muralidharan [mailto:[email protected]] 
>> Sent: Thursday, December 24, 2009 7:54 AM
>> To: [email protected]
>> Subject: Re: Some newbie questions
>>
>>
>> If this is a one-time operation in your pipeline and you are ok with 
>> splitting it, you might want to consider using hadoop directly and 
>> splitting based on multiple-output collector.
>>
>> It can be a map-only job with a line record reader or similar, a map 
>> function which does the split as you were doing in the existing db code, 
>> and writing to appropriate output collector based on the type.
>>
>>
>> All further analysis can be through pig - which works on a more type 
>> specific schema aware form (assuming each type has a fixed schema, while 
>> initial jumble of types does not have uniform schema).
>>
>>
>>
>> Not sure if it is practical since i have not used this for map-only jobs ...
>>
>> Regards,
>> Mridul
>>
>>
>> Gökhan Çapan wrote:
>>> Hi, probably that was discussed before in this list, but i couldn't find.
>>> We are implementing log analysis tools for some web sites that have high
>>> traffic.
>>> From now on, we want to use Pig to implement such analysis tools.
>>>
>>> We have millions of logs of a web site in a  session-URL-time format.
>>> This is not just search logs, or just product views, but it consists of
>>> different types of actions.
>>>
>>> For example, if a URL contains a specific pattern, we call it a search log,
>>> etc.
>>>
>>> Until now, I was using a factory method to instantiate appropriate
>>> URLHandler and after extracting some information from URL, I was storing
>>> this information to  the appropriate database table. For example if the
>>> program decides a URL is a search log, it extracts session, query, time,
>>> corrects typos, determine implicit rating, goes to Search table(this is a
>>> relational database table), and store these to the table. If the program
>>> decides a URL is a product view log, it extracts session, member_id,
>>> product_id, time, product title, rating for product, goes to Product_View
>>> table and stores it. After finishing storing, for example, it extracts
>>> popular queries for assisting search.
>>>
>>> If I want to do all of these with Pig;
>>> - Should I partition the global log file to separate files(search_logs and
>>> product_view_logs are in seperate files)? or
>>> - Can some pig commands load data, treat each tuple with its type (e.g. This
>>> is a search log and it should have "session-query-time-implicit rating") and
>>> I can get rid of partitioning data for each type of log?
>>>
>>> I have just downloaded Pig and it seems it is able to do such tasks, and I
>>> will appreciate if anyone can show me a starting point for such an
>>> application, and share some ideas.
>>> Thank you.
>

RE: Some newbie questions

Reply via email to