That is not exactly the same as what I proposed - not just from a
performance/implementation point of view, but probably also from a code
reuse point of view. But then I am not very familiar with multi-query
work much because of the stability issues I face when using it - so
probably it is comparable !
Ofcourse, to each his own :-)
Regards,
Mridul
Richard Ding wrote:
Pig supports queries that have multiple outputs (multi-query support).
Thanks,
-Richard
-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]]
Sent: Thursday, December 24, 2009 10:10 AM
To: [email protected]
Subject: Re: Some newbie questions
IIRC pig does not support multiple output collector, does it ?
And lack of common schema in this case (each type has its own schema) is
worrying.
Regards,
Mridul
Richard Ding wrote:
Actually, you don't need to use Hadoop to create this map-only job, Pig will do it for you.
Thanks,
-Richard
-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]]
Sent: Thursday, December 24, 2009 7:54 AM
To: [email protected]
Subject: Re: Some newbie questions
If this is a one-time operation in your pipeline and you are ok with
splitting it, you might want to consider using hadoop directly and
splitting based on multiple-output collector.
It can be a map-only job with a line record reader or similar, a map
function which does the split as you were doing in the existing db code,
and writing to appropriate output collector based on the type.
All further analysis can be through pig - which works on a more type
specific schema aware form (assuming each type has a fixed schema, while
initial jumble of types does not have uniform schema).
Not sure if it is practical since i have not used this for map-only jobs ...
Regards,
Mridul
Gökhan Çapan wrote:
Hi, probably that was discussed before in this list, but i couldn't find.
We are implementing log analysis tools for some web sites that have high
traffic.
From now on, we want to use Pig to implement such analysis tools.
We have millions of logs of a web site in a session-URL-time format.
This is not just search logs, or just product views, but it consists of
different types of actions.
For example, if a URL contains a specific pattern, we call it a search log,
etc.
Until now, I was using a factory method to instantiate appropriate
URLHandler and after extracting some information from URL, I was storing
this information to the appropriate database table. For example if the
program decides a URL is a search log, it extracts session, query, time,
corrects typos, determine implicit rating, goes to Search table(this is a
relational database table), and store these to the table. If the program
decides a URL is a product view log, it extracts session, member_id,
product_id, time, product title, rating for product, goes to Product_View
table and stores it. After finishing storing, for example, it extracts
popular queries for assisting search.
If I want to do all of these with Pig;
- Should I partition the global log file to separate files(search_logs and
product_view_logs are in seperate files)? or
- Can some pig commands load data, treat each tuple with its type (e.g. This
is a search log and it should have "session-query-time-implicit rating") and
I can get rid of partitioning data for each type of log?
I have just downloaded Pig and it seems it is able to do such tasks, and I
will appreciate if anyone can show me a starting point for such an
application, and share some ideas.
Thank you.