I might misunderstand but, tables have common fields: session id, time, for
instance. Are you asking this?

On Fri, Dec 25, 2009 at 2:50 PM, Mridul Muralidharan
<[email protected]>wrote:

>
> Interesting.
>
> I am wondering, if the schemas for various types are different, then how
> would you be projecting it out consistently for the analysis ?
> As in, for the session analysis across types, you will just pull a subset
> of the fields which exist across tyoes ?
>
>
>
> Regards,
> Mridul
>
> Gökhan Çapan wrote:
>
>> Actually it is not a one-time operation. Also, my analysis is not only
>> type
>> specific.
>> As an example, when I want to create a list of popular search queries, I
>> use
>> a type-specific analysis.
>> But, when I want to get an average number of clicks for a session which
>> contains search action, I use a global analysis.
>> That is where our current approach fails. After splitting all types of
>> logs,
>> when we need a global analysis result, we need some joins on different
>> tables. These tables have millions of rows, and execution of queries takes
>> too long. That's why I thought some slight modifications on Dmitriy's
>> approach will solve the problem.
>>
>> On Thu, Dec 24, 2009 at 5:53 PM, Mridul Muralidharan
>> <[email protected]>wrote:
>>
>>  If this is a one-time operation in your pipeline and you are ok with
>>> splitting it, you might want to consider using hadoop directly and
>>> splitting
>>> based on multiple-output collector.
>>>
>>> It can be a map-only job with a line record reader or similar, a map
>>> function which does the split as you were doing in the existing db code,
>>> and
>>> writing to appropriate output collector based on the type.
>>>
>>>
>>> All further analysis can be through pig - which works on a more type
>>> specific schema aware form (assuming each type has a fixed schema, while
>>> initial jumble of types does not have uniform schema).
>>>
>>>
>>>
>>> Not sure if it is practical since i have not used this for map-only jobs
>>> ...
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>> Gökhan Çapan wrote:
>>>
>>>  Hi, probably that was discussed before in this list, but i couldn't
>>>> find.
>>>> We are implementing log analysis tools for some web sites that have high
>>>> traffic.
>>>> From now on, we want to use Pig to implement such analysis tools.
>>>>
>>>> We have millions of logs of a web site in a  session-URL-time format.
>>>> This is not just search logs, or just product views, but it consists of
>>>> different types of actions.
>>>>
>>>> For example, if a URL contains a specific pattern, we call it a search
>>>> log,
>>>> etc.
>>>>
>>>> Until now, I was using a factory method to instantiate appropriate
>>>> URLHandler and after extracting some information from URL, I was storing
>>>> this information to  the appropriate database table. For example if the
>>>> program decides a URL is a search log, it extracts session, query, time,
>>>> corrects typos, determine implicit rating, goes to Search table(this is
>>>> a
>>>> relational database table), and store these to the table. If the program
>>>> decides a URL is a product view log, it extracts session, member_id,
>>>> product_id, time, product title, rating for product, goes to
>>>> Product_View
>>>> table and stores it. After finishing storing, for example, it extracts
>>>> popular queries for assisting search.
>>>>
>>>> If I want to do all of these with Pig;
>>>> - Should I partition the global log file to separate files(search_logs
>>>> and
>>>> product_view_logs are in seperate files)? or
>>>> - Can some pig commands load data, treat each tuple with its type (e.g.
>>>> This
>>>> is a search log and it should have "session-query-time-implicit rating")
>>>> and
>>>> I can get rid of partitioning data for each type of log?
>>>>
>>>> I have just downloaded Pig and it seems it is able to do such tasks, and
>>>> I
>>>> will appreciate if anyone can show me a starting point for such an
>>>> application, and share some ideas.
>>>> Thank you.
>>>>
>>>>
>>>
>>
>>
>


-- 
Gökhan Çapan
Dilişim

Reply via email to