Re: Some newbie questions

Gökhan Çapan Fri, 25 Dec 2009 05:17:27 -0800

And yes, we are using common subfields in different tables.

On Fri, Dec 25, 2009 at 3:15 PM, Gökhan Çapan <[email protected]> wrote:


> I might misunderstand but, tables have common fields: session id, time, for
> instance. Are you asking this?
>
>
> On Fri, Dec 25, 2009 at 2:50 PM, Mridul Muralidharan <
> [email protected]> wrote:
>
>>
>> Interesting.
>>
>> I am wondering, if the schemas for various types are different, then how
>> would you be projecting it out consistently for the analysis ?
>> As in, for the session analysis across types, you will just pull a subset
>> of the fields which exist across tyoes ?
>>
>>
>>
>> Regards,
>> Mridul
>>
>> Gökhan Çapan wrote:
>>
>>> Actually it is not a one-time operation. Also, my analysis is not only
>>> type
>>> specific.
>>> As an example, when I want to create a list of popular search queries, I
>>> use
>>> a type-specific analysis.
>>> But, when I want to get an average number of clicks for a session which
>>> contains search action, I use a global analysis.
>>> That is where our current approach fails. After splitting all types of
>>> logs,
>>> when we need a global analysis result, we need some joins on different
>>> tables. These tables have millions of rows, and execution of queries
>>> takes
>>> too long. That's why I thought some slight modifications on Dmitriy's
>>> approach will solve the problem.
>>>
>>> On Thu, Dec 24, 2009 at 5:53 PM, Mridul Muralidharan
>>> <[email protected]>wrote:
>>>
>>>  If this is a one-time operation in your pipeline and you are ok with
>>>> splitting it, you might want to consider using hadoop directly and
>>>> splitting
>>>> based on multiple-output collector.
>>>>
>>>> It can be a map-only job with a line record reader or similar, a map
>>>> function which does the split as you were doing in the existing db code,
>>>> and
>>>> writing to appropriate output collector based on the type.
>>>>
>>>>
>>>> All further analysis can be through pig - which works on a more type
>>>> specific schema aware form (assuming each type has a fixed schema, while
>>>> initial jumble of types does not have uniform schema).
>>>>
>>>>
>>>>
>>>> Not sure if it is practical since i have not used this for map-only jobs
>>>> ...
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>> Gökhan Çapan wrote:
>>>>
>>>>  Hi, probably that was discussed before in this list, but i couldn't
>>>>> find.
>>>>> We are implementing log analysis tools for some web sites that have
>>>>> high
>>>>> traffic.
>>>>> From now on, we want to use Pig to implement such analysis tools.
>>>>>
>>>>> We have millions of logs of a web site in a  session-URL-time format.
>>>>> This is not just search logs, or just product views, but it consists of
>>>>> different types of actions.
>>>>>
>>>>> For example, if a URL contains a specific pattern, we call it a search
>>>>> log,
>>>>> etc.
>>>>>
>>>>> Until now, I was using a factory method to instantiate appropriate
>>>>> URLHandler and after extracting some information from URL, I was
>>>>> storing
>>>>> this information to  the appropriate database table. For example if the
>>>>> program decides a URL is a search log, it extracts session, query,
>>>>> time,
>>>>> corrects typos, determine implicit rating, goes to Search table(this is
>>>>> a
>>>>> relational database table), and store these to the table. If the
>>>>> program
>>>>> decides a URL is a product view log, it extracts session, member_id,
>>>>> product_id, time, product title, rating for product, goes to
>>>>> Product_View
>>>>> table and stores it. After finishing storing, for example, it extracts
>>>>> popular queries for assisting search.
>>>>>
>>>>> If I want to do all of these with Pig;
>>>>> - Should I partition the global log file to separate files(search_logs
>>>>> and
>>>>> product_view_logs are in seperate files)? or
>>>>> - Can some pig commands load data, treat each tuple with its type (e.g.
>>>>> This
>>>>> is a search log and it should have "session-query-time-implicit
>>>>> rating")
>>>>> and
>>>>> I can get rid of partitioning data for each type of log?
>>>>>
>>>>> I have just downloaded Pig and it seems it is able to do such tasks,
>>>>> and I
>>>>> will appreciate if anyone can show me a starting point for such an
>>>>> application, and share some ideas.
>>>>> Thank you.
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Gökhan Çapan
> Dilişim
>



-- 
Gökhan Çapan
Dilişim

Re: Some newbie questions

Reply via email to