No you didn't misunderstand, and thank you very much for these advice. But, I couldn't understand what you meant in the 1st option.
On Thu, Dec 24, 2009 at 11:36 AM, Jeff Zhang <[email protected]> wrote: > Hi Gökhan, I assume your log is one record each line. And it seems that > your logs have different types and different type of log have different > fields. Then if you'd like to use pig for your case, I think you have > several options: > > Option 1. You can create different LoadFunc for each type of your log, > filter other types in the LoadFunc if they are not the type you want. > > Option 2. Split each type of logs into different files, then load logs use > each log type's respective LoadFunc > > Option 3. Do not split your log files, normalize your log's fields, here > the > normalization means merging the fields of all the log types into a large > field set. e.g. you have two type of logs, one has fields (A_1,A_2), other > has fields (B_1,B_2), then you can merge them into a large field set: ( > Log_Type, A_1, A_2, B_1, B_2). And then split the logs in pig script using > the split statement. > > > What method to used depend on your requirement and situation, hope I did > not > misunderstand your meaning. > > > Jeff Zhang > > > On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan <[email protected]> wrote: > > > Hi, probably that was discussed before in this list, but i couldn't find. > > We are implementing log analysis tools for some web sites that have high > > traffic. > > From now on, we want to use Pig to implement such analysis tools. > > > > We have millions of logs of a web site in a session-URL-time format. > > This is not just search logs, or just product views, but it consists of > > different types of actions. > > > > For example, if a URL contains a specific pattern, we call it a search > log, > > etc. > > > > Until now, I was using a factory method to instantiate appropriate > > URLHandler and after extracting some information from URL, I was storing > > this information to the appropriate database table. For example if the > > program decides a URL is a search log, it extracts session, query, time, > > corrects typos, determine implicit rating, goes to Search table(this is a > > relational database table), and store these to the table. If the program > > decides a URL is a product view log, it extracts session, member_id, > > product_id, time, product title, rating for product, goes to Product_View > > table and stores it. After finishing storing, for example, it extracts > > popular queries for assisting search. > > > > If I want to do all of these with Pig; > > - Should I partition the global log file to separate files(search_logs > and > > product_view_logs are in seperate files)? or > > - Can some pig commands load data, treat each tuple with its type (e.g. > > This > > is a search log and it should have "session-query-time-implicit rating") > > and > > I can get rid of partitioning data for each type of log? > > > > I have just downloaded Pig and it seems it is able to do such tasks, and > I > > will appreciate if anyone can show me a starting point for such an > > application, and share some ideas. > > Thank you. > > -- > > Gökhan Çapan > > Dilişim > > > -- Gökhan Çapan Dilişim
