Re: Some newbie questions

Mridul Muralidharan Thu, 24 Dec 2009 07:54:46 -0800

If this is a one-time operation in your pipeline and you are ok withsplitting it, you might want to consider using hadoop directly andsplitting based on multiple-output collector.

It can be a map-only job with a line record reader or similar, a mapfunction which does the split as you were doing in the existing db code,and writing to appropriate output collector based on the type.

All further analysis can be through pig - which works on a more typespecific schema aware form (assuming each type has a fixed schema, whileinitial jumble of types does not have uniform schema).




Not sure if it is practical since i have not used this for map-only jobs ...

Regards,
Mridul


Gökhan Çapan wrote:

Hi, probably that was discussed before in this list, but i couldn't find.
We are implementing log analysis tools for some web sites that have high
traffic.
From now on, we want to use Pig to implement such analysis tools.

We have millions of logs of a web site in a  session-URL-time format.
This is not just search logs, or just product views, but it consists of
different types of actions.

For example, if a URL contains a specific pattern, we call it a search log,
etc.

Until now, I was using a factory method to instantiate appropriate
URLHandler and after extracting some information from URL, I was storing
this information to  the appropriate database table. For example if the
program decides a URL is a search log, it extracts session, query, time,
corrects typos, determine implicit rating, goes to Search table(this is a
relational database table), and store these to the table. If the program
decides a URL is a product view log, it extracts session, member_id,
product_id, time, product title, rating for product, goes to Product_View
table and stores it. After finishing storing, for example, it extracts
popular queries for assisting search.

If I want to do all of these with Pig;
- Should I partition the global log file to separate files(search_logs and
product_view_logs are in seperate files)? or
- Can some pig commands load data, treat each tuple with its type (e.g. This
is a search log and it should have "session-query-time-implicit rating") and
I can get rid of partitioning data for each type of log?

I have just downloaded Pig and it seems it is able to do such tasks, and I
will appreciate if anyone can show me a starting point for such an
application, and share some ideas.
Thank you.

Re: Some newbie questions

Reply via email to