And yes, we are using common subfields in different tables. On Fri, Dec 25, 2009 at 3:15 PM, Gökhan Çapan <[email protected]> wrote:
> I might misunderstand but, tables have common fields: session id, time, for > instance. Are you asking this? > > > On Fri, Dec 25, 2009 at 2:50 PM, Mridul Muralidharan < > [email protected]> wrote: > >> >> Interesting. >> >> I am wondering, if the schemas for various types are different, then how >> would you be projecting it out consistently for the analysis ? >> As in, for the session analysis across types, you will just pull a subset >> of the fields which exist across tyoes ? >> >> >> >> Regards, >> Mridul >> >> Gökhan Çapan wrote: >> >>> Actually it is not a one-time operation. Also, my analysis is not only >>> type >>> specific. >>> As an example, when I want to create a list of popular search queries, I >>> use >>> a type-specific analysis. >>> But, when I want to get an average number of clicks for a session which >>> contains search action, I use a global analysis. >>> That is where our current approach fails. After splitting all types of >>> logs, >>> when we need a global analysis result, we need some joins on different >>> tables. These tables have millions of rows, and execution of queries >>> takes >>> too long. That's why I thought some slight modifications on Dmitriy's >>> approach will solve the problem. >>> >>> On Thu, Dec 24, 2009 at 5:53 PM, Mridul Muralidharan >>> <[email protected]>wrote: >>> >>> If this is a one-time operation in your pipeline and you are ok with >>>> splitting it, you might want to consider using hadoop directly and >>>> splitting >>>> based on multiple-output collector. >>>> >>>> It can be a map-only job with a line record reader or similar, a map >>>> function which does the split as you were doing in the existing db code, >>>> and >>>> writing to appropriate output collector based on the type. >>>> >>>> >>>> All further analysis can be through pig - which works on a more type >>>> specific schema aware form (assuming each type has a fixed schema, while >>>> initial jumble of types does not have uniform schema). >>>> >>>> >>>> >>>> Not sure if it is practical since i have not used this for map-only jobs >>>> ... >>>> >>>> Regards, >>>> Mridul >>>> >>>> >>>> >>>> Gökhan Çapan wrote: >>>> >>>> Hi, probably that was discussed before in this list, but i couldn't >>>>> find. >>>>> We are implementing log analysis tools for some web sites that have >>>>> high >>>>> traffic. >>>>> From now on, we want to use Pig to implement such analysis tools. >>>>> >>>>> We have millions of logs of a web site in a session-URL-time format. >>>>> This is not just search logs, or just product views, but it consists of >>>>> different types of actions. >>>>> >>>>> For example, if a URL contains a specific pattern, we call it a search >>>>> log, >>>>> etc. >>>>> >>>>> Until now, I was using a factory method to instantiate appropriate >>>>> URLHandler and after extracting some information from URL, I was >>>>> storing >>>>> this information to the appropriate database table. For example if the >>>>> program decides a URL is a search log, it extracts session, query, >>>>> time, >>>>> corrects typos, determine implicit rating, goes to Search table(this is >>>>> a >>>>> relational database table), and store these to the table. If the >>>>> program >>>>> decides a URL is a product view log, it extracts session, member_id, >>>>> product_id, time, product title, rating for product, goes to >>>>> Product_View >>>>> table and stores it. After finishing storing, for example, it extracts >>>>> popular queries for assisting search. >>>>> >>>>> If I want to do all of these with Pig; >>>>> - Should I partition the global log file to separate files(search_logs >>>>> and >>>>> product_view_logs are in seperate files)? or >>>>> - Can some pig commands load data, treat each tuple with its type (e.g. >>>>> This >>>>> is a search log and it should have "session-query-time-implicit >>>>> rating") >>>>> and >>>>> I can get rid of partitioning data for each type of log? >>>>> >>>>> I have just downloaded Pig and it seems it is able to do such tasks, >>>>> and I >>>>> will appreciate if anyone can show me a starting point for such an >>>>> application, and share some ideas. >>>>> Thank you. >>>>> >>>>> >>>> >>> >>> >> > > > -- > Gökhan Çapan > Dilişim > -- Gökhan Çapan Dilişim
