No you didn't misunderstand, and thank you very much for these advice. But,
I couldn't understand what you meant in the 1st option.

On Thu, Dec 24, 2009 at 11:36 AM, Jeff Zhang <[email protected]> wrote:

> Hi Gökhan,  I assume your log is one record each line. And it seems that
> your logs have different types and different type of log have different
> fields. Then if you'd like to use pig for your case, I think you have
> several options:
>
> Option 1. You can create different LoadFunc for each type of your log,
> filter other types in the LoadFunc if they are not the type you want.
>
> Option 2. Split each type of logs into different files, then load logs use
> each log type's respective LoadFunc
>
> Option 3. Do not split your log files, normalize your log's fields, here
> the
> normalization means merging the fields of all the log types into a large
> field set.  e.g. you have two type of logs, one has fields (A_1,A_2), other
> has fields (B_1,B_2), then you can merge them into a large field set: (
> Log_Type, A_1, A_2, B_1, B_2).  And then split the logs in pig script using
> the split statement.
>
>
> What method to used depend on your requirement and situation, hope I did
> not
> misunderstand your meaning.
>
>
> Jeff Zhang
>
>
> On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan <[email protected]> wrote:
>
> > Hi, probably that was discussed before in this list, but i couldn't find.
> > We are implementing log analysis tools for some web sites that have high
> > traffic.
> > From now on, we want to use Pig to implement such analysis tools.
> >
> > We have millions of logs of a web site in a  session-URL-time format.
> > This is not just search logs, or just product views, but it consists of
> > different types of actions.
> >
> > For example, if a URL contains a specific pattern, we call it a search
> log,
> > etc.
> >
> > Until now, I was using a factory method to instantiate appropriate
> > URLHandler and after extracting some information from URL, I was storing
> > this information to  the appropriate database table. For example if the
> > program decides a URL is a search log, it extracts session, query, time,
> > corrects typos, determine implicit rating, goes to Search table(this is a
> > relational database table), and store these to the table. If the program
> > decides a URL is a product view log, it extracts session, member_id,
> > product_id, time, product title, rating for product, goes to Product_View
> > table and stores it. After finishing storing, for example, it extracts
> > popular queries for assisting search.
> >
> > If I want to do all of these with Pig;
> > - Should I partition the global log file to separate files(search_logs
> and
> > product_view_logs are in seperate files)? or
> > - Can some pig commands load data, treat each tuple with its type (e.g.
> > This
> > is a search log and it should have "session-query-time-implicit rating")
> > and
> > I can get rid of partitioning data for each type of log?
> >
> > I have just downloaded Pig and it seems it is able to do such tasks, and
> I
> > will appreciate if anyone can show me a starting point for such an
> > application, and share some ideas.
> > Thank you.
> > --
> > Gökhan Çapan
> > Dilişim
> >
>



-- 
Gökhan Çapan
Dilişim

Reply via email to