Re: Some newbie questions

Mridul Muralidharan Thu, 24 Dec 2009 10:46:24 -0800

That is not exactly the same as what I proposed - not just from aperformance/implementation point of view, but probably also from a codereuse point of view. But then I am not very familiar with multi-querywork much because of the stability issues I face when using it - soprobably it is comparable !


Ofcourse, to each his own :-)

Regards,
Mridul

Richard Ding wrote:

Pig supports queries that have multiple outputs (multi-query support).
Thanks,
-Richard

-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]]Sent: Thursday, December 24, 2009 10:10 AM
To: [email protected]
Subject: Re: Some newbie questions


IIRC pig does not support multiple output collector, does it ?
And lack of common schema in this case (each type has its own schema) isworrying.
Regards,
Mridul

Richard Ding wrote:
Actually, you don't need to use Hadoop to create this map-only job, Pig will do it for you.
Thanks,
-Richard
-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]]Sent: Thursday, December 24, 2009 7:54 AM
To: [email protected]
Subject: Re: Some newbie questions
If this is a one-time operation in your pipeline and you are ok withsplitting it, you might want to consider using hadoop directly andsplitting based on multiple-output collector.
It can be a map-only job with a line record reader or similar, a mapfunction which does the split as you were doing in the existing db code,and writing to appropriate output collector based on the type.
All further analysis can be through pig - which works on a more typespecific schema aware form (assuming each type has a fixed schema, whileinitial jumble of types does not have uniform schema).
Not sure if it is practical since i have not used this for map-only jobs ...

Regards,
Mridul


Gökhan Çapan wrote:
Hi, probably that was discussed before in this list, but i couldn't find.
We are implementing log analysis tools for some web sites that have high
traffic.
From now on, we want to use Pig to implement such analysis tools.

We have millions of logs of a web site in a  session-URL-time format.
This is not just search logs, or just product views, but it consists of
different types of actions.

For example, if a URL contains a specific pattern, we call it a search log,
etc.

Until now, I was using a factory method to instantiate appropriate
URLHandler and after extracting some information from URL, I was storing
this information to  the appropriate database table. For example if the
program decides a URL is a search log, it extracts session, query, time,
corrects typos, determine implicit rating, goes to Search table(this is a
relational database table), and store these to the table. If the program
decides a URL is a product view log, it extracts session, member_id,
product_id, time, product title, rating for product, goes to Product_View
table and stores it. After finishing storing, for example, it extracts
popular queries for assisting search.

If I want to do all of these with Pig;
- Should I partition the global log file to separate files(search_logs and
product_view_logs are in seperate files)? or
- Can some pig commands load data, treat each tuple with its type (e.g. This
is a search log and it should have "session-query-time-implicit rating") and
I can get rid of partitioning data for each type of log?

I have just downloaded Pig and it seems it is able to do such tasks, and I
will appreciate if anyone can show me a starting point for such an
application, and share some ideas.
Thank you.

Re: Some newbie questions

Reply via email to