Actually it is not a one-time operation. Also, my analysis is not only type specific. As an example, when I want to create a list of popular search queries, I use a type-specific analysis. But, when I want to get an average number of clicks for a session which contains search action, I use a global analysis. That is where our current approach fails. After splitting all types of logs, when we need a global analysis result, we need some joins on different tables. These tables have millions of rows, and execution of queries takes too long. That's why I thought some slight modifications on Dmitriy's approach will solve the problem.
On Thu, Dec 24, 2009 at 5:53 PM, Mridul Muralidharan <[email protected]>wrote: > > If this is a one-time operation in your pipeline and you are ok with > splitting it, you might want to consider using hadoop directly and splitting > based on multiple-output collector. > > It can be a map-only job with a line record reader or similar, a map > function which does the split as you were doing in the existing db code, and > writing to appropriate output collector based on the type. > > > All further analysis can be through pig - which works on a more type > specific schema aware form (assuming each type has a fixed schema, while > initial jumble of types does not have uniform schema). > > > > Not sure if it is practical since i have not used this for map-only jobs > ... > > Regards, > Mridul > > > > Gökhan Çapan wrote: > >> Hi, probably that was discussed before in this list, but i couldn't find. >> We are implementing log analysis tools for some web sites that have high >> traffic. >> From now on, we want to use Pig to implement such analysis tools. >> >> We have millions of logs of a web site in a session-URL-time format. >> This is not just search logs, or just product views, but it consists of >> different types of actions. >> >> For example, if a URL contains a specific pattern, we call it a search >> log, >> etc. >> >> Until now, I was using a factory method to instantiate appropriate >> URLHandler and after extracting some information from URL, I was storing >> this information to the appropriate database table. For example if the >> program decides a URL is a search log, it extracts session, query, time, >> corrects typos, determine implicit rating, goes to Search table(this is a >> relational database table), and store these to the table. If the program >> decides a URL is a product view log, it extracts session, member_id, >> product_id, time, product title, rating for product, goes to Product_View >> table and stores it. After finishing storing, for example, it extracts >> popular queries for assisting search. >> >> If I want to do all of these with Pig; >> - Should I partition the global log file to separate files(search_logs and >> product_view_logs are in seperate files)? or >> - Can some pig commands load data, treat each tuple with its type (e.g. >> This >> is a search log and it should have "session-query-time-implicit rating") >> and >> I can get rid of partitioning data for each type of log? >> >> I have just downloaded Pig and it seems it is able to do such tasks, and I >> will appreciate if anyone can show me a starting point for such an >> application, and share some ideas. >> Thank you. >> > > -- Gökhan Çapan Dilişim
