Hi Gökhan, I assume your log is one record each line. And it seems that your logs have different types and different type of log have different fields. Then if you'd like to use pig for your case, I think you have several options:
Option 1. You can create different LoadFunc for each type of your log, filter other types in the LoadFunc if they are not the type you want. Option 2. Split each type of logs into different files, then load logs use each log type's respective LoadFunc Option 3. Do not split your log files, normalize your log's fields, here the normalization means merging the fields of all the log types into a large field set. e.g. you have two type of logs, one has fields (A_1,A_2), other has fields (B_1,B_2), then you can merge them into a large field set: ( Log_Type, A_1, A_2, B_1, B_2). And then split the logs in pig script using the split statement. What method to used depend on your requirement and situation, hope I did not misunderstand your meaning. Jeff Zhang On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan <[email protected]> wrote: > Hi, probably that was discussed before in this list, but i couldn't find. > We are implementing log analysis tools for some web sites that have high > traffic. > From now on, we want to use Pig to implement such analysis tools. > > We have millions of logs of a web site in a session-URL-time format. > This is not just search logs, or just product views, but it consists of > different types of actions. > > For example, if a URL contains a specific pattern, we call it a search log, > etc. > > Until now, I was using a factory method to instantiate appropriate > URLHandler and after extracting some information from URL, I was storing > this information to the appropriate database table. For example if the > program decides a URL is a search log, it extracts session, query, time, > corrects typos, determine implicit rating, goes to Search table(this is a > relational database table), and store these to the table. If the program > decides a URL is a product view log, it extracts session, member_id, > product_id, time, product title, rating for product, goes to Product_View > table and stores it. After finishing storing, for example, it extracts > popular queries for assisting search. > > If I want to do all of these with Pig; > - Should I partition the global log file to separate files(search_logs and > product_view_logs are in seperate files)? or > - Can some pig commands load data, treat each tuple with its type (e.g. > This > is a search log and it should have "session-query-time-implicit rating") > and > I can get rid of partitioning data for each type of log? > > I have just downloaded Pig and it seems it is able to do such tasks, and I > will appreciate if anyone can show me a starting point for such an > application, and share some ideas. > Thank you. > -- > Gökhan Çapan > Dilişim >
