Hi, probably that was discussed before in this list, but i couldn't find.
We are implementing log analysis tools for some web sites that have high
traffic.
>From now on, we want to use Pig to implement such analysis tools.

We have millions of logs of a web site in a  session-URL-time format.
This is not just search logs, or just product views, but it consists of
different types of actions.

For example, if a URL contains a specific pattern, we call it a search log,
etc.

Until now, I was using a factory method to instantiate appropriate
URLHandler and after extracting some information from URL, I was storing
this information to  the appropriate database table. For example if the
program decides a URL is a search log, it extracts session, query, time,
corrects typos, determine implicit rating, goes to Search table(this is a
relational database table), and store these to the table. If the program
decides a URL is a product view log, it extracts session, member_id,
product_id, time, product title, rating for product, goes to Product_View
table and stores it. After finishing storing, for example, it extracts
popular queries for assisting search.

If I want to do all of these with Pig;
- Should I partition the global log file to separate files(search_logs and
product_view_logs are in seperate files)? or
- Can some pig commands load data, treat each tuple with its type (e.g. This
is a search log and it should have "session-query-time-implicit rating") and
I can get rid of partitioning data for each type of log?

I have just downloaded Pig and it seems it is able to do such tasks, and I
will appreciate if anyone can show me a starting point for such an
application, and share some ideas.
Thank you.
-- 
Gökhan Çapan
Dilişim

Reply via email to