I might misunderstand but, tables have common fields: session id, time, for instance. Are you asking this?
On Fri, Dec 25, 2009 at 2:50 PM, Mridul Muralidharan <[email protected]>wrote: > > Interesting. > > I am wondering, if the schemas for various types are different, then how > would you be projecting it out consistently for the analysis ? > As in, for the session analysis across types, you will just pull a subset > of the fields which exist across tyoes ? > > > > Regards, > Mridul > > Gökhan Çapan wrote: > >> Actually it is not a one-time operation. Also, my analysis is not only >> type >> specific. >> As an example, when I want to create a list of popular search queries, I >> use >> a type-specific analysis. >> But, when I want to get an average number of clicks for a session which >> contains search action, I use a global analysis. >> That is where our current approach fails. After splitting all types of >> logs, >> when we need a global analysis result, we need some joins on different >> tables. These tables have millions of rows, and execution of queries takes >> too long. That's why I thought some slight modifications on Dmitriy's >> approach will solve the problem. >> >> On Thu, Dec 24, 2009 at 5:53 PM, Mridul Muralidharan >> <[email protected]>wrote: >> >> If this is a one-time operation in your pipeline and you are ok with >>> splitting it, you might want to consider using hadoop directly and >>> splitting >>> based on multiple-output collector. >>> >>> It can be a map-only job with a line record reader or similar, a map >>> function which does the split as you were doing in the existing db code, >>> and >>> writing to appropriate output collector based on the type. >>> >>> >>> All further analysis can be through pig - which works on a more type >>> specific schema aware form (assuming each type has a fixed schema, while >>> initial jumble of types does not have uniform schema). >>> >>> >>> >>> Not sure if it is practical since i have not used this for map-only jobs >>> ... >>> >>> Regards, >>> Mridul >>> >>> >>> >>> Gökhan Çapan wrote: >>> >>> Hi, probably that was discussed before in this list, but i couldn't >>>> find. >>>> We are implementing log analysis tools for some web sites that have high >>>> traffic. >>>> From now on, we want to use Pig to implement such analysis tools. >>>> >>>> We have millions of logs of a web site in a session-URL-time format. >>>> This is not just search logs, or just product views, but it consists of >>>> different types of actions. >>>> >>>> For example, if a URL contains a specific pattern, we call it a search >>>> log, >>>> etc. >>>> >>>> Until now, I was using a factory method to instantiate appropriate >>>> URLHandler and after extracting some information from URL, I was storing >>>> this information to the appropriate database table. For example if the >>>> program decides a URL is a search log, it extracts session, query, time, >>>> corrects typos, determine implicit rating, goes to Search table(this is >>>> a >>>> relational database table), and store these to the table. If the program >>>> decides a URL is a product view log, it extracts session, member_id, >>>> product_id, time, product title, rating for product, goes to >>>> Product_View >>>> table and stores it. After finishing storing, for example, it extracts >>>> popular queries for assisting search. >>>> >>>> If I want to do all of these with Pig; >>>> - Should I partition the global log file to separate files(search_logs >>>> and >>>> product_view_logs are in seperate files)? or >>>> - Can some pig commands load data, treat each tuple with its type (e.g. >>>> This >>>> is a search log and it should have "session-query-time-implicit rating") >>>> and >>>> I can get rid of partitioning data for each type of log? >>>> >>>> I have just downloaded Pig and it seems it is able to do such tasks, and >>>> I >>>> will appreciate if anyone can show me a starting point for such an >>>> application, and share some ideas. >>>> Thank you. >>>> >>>> >>> >> >> > -- Gökhan Çapan Dilişim
