Re: Some newbie questions

Gökhan Çapan Thu, 24 Dec 2009 04:59:44 -0800

That is exactly what I need. I was wondering the flexibility in such user
defined implementations.
Thanks, Dmitriy.


2009/12/24 Dmitriy Ryaboy <[email protected]>

> Right, I was thinking you would put all that logic into the LoadFunc.
>
> So, in pseudo-code:
>
> MyLoadFunc.getNext():
>  string = fetch_next_string();
>  id = determine_record_type();
>  switch (id) :
>     case 1: tuple = process_type_1(string); break
>     case 2: tuple = process_type_2(string); break;
>    etc
>  return new Tuple(id, tuple);
>
> process_type_1 and process_type_2 can do completely different things.
>
> Now, in pig:
>
> data = load '/logs/today' using MyLoadFunc as (id:int, dataTuple:tuple());
> split data into type1==1 if id=1, type2 if type==2;
> one_w_types = foreach type1 generate $0, (double) $1.$0 as price,
> (int) $1.$1 as product_id, (chararray) $1.$2 as description;
>
>
> and so on.
>
> -D
>
>
>
> 2009/12/24 Gökhan Çapan <[email protected]>:
> > In addition, the tuples seem same. Each line has a "id-session-URL-time"
> > format.
> > My problem is, reading a row, deciding which type it is and storing it
> with
> > a different format.
> >
> > For example if I read a row and decided it is a search log, I want to
> > extract 3 additional fields from the URL and store it this way.
> > If it is in a different type, I will extract 4-5 new fields from the URL
> and
> > store it this way.
> >
> > i mean all input is "id-session-URL-time". (Names of fields in input
> lines
> > are not distinguishing)
> >
> > one output is:"id-session-A1-A2-A3-time"
> > one output is:"id-session-B1-B2-B3-B4-time"
> > .
> > .
> > .
> > *
> > -read data as id,session,URL,time fields.
> > -do some process on URL and make "URL" a container of different
> subfields.
> > -store them.*
> > for example that may be a solution. Is that possible?
> >
> >
> > 2009/12/24 Dmitriy Ryaboy <[email protected]>
> >
> >> Option 4:
> >>
> >> Your loader/parser, upon reading a line of logs, creates an
> >> appropriate record with its type-specific fields, and emits
> >> (type_specifier:int, data:tuple). Then split by the type specifier,
> >> and apply type-specific schemas to the tuple after the split.
> >>
> >> -Dmitriy
> >>
> >> On Thu, Dec 24, 2009 at 3:26 AM, Gökhan Çapan <[email protected]>
> wrote:
> >> > Yeah, I've got it now. That is scanning all data for each type.
> >> > Unfortunately, I think I will have to do it in 1st or 3rd way.
> >> > Thank you again, Jeff.
> >> >
> >> > On Thu, Dec 24, 2009 at 12:00 PM, Jeff Zhang <[email protected]>
> wrote:
> >> >
> >> >> The first option means Log Type A's LoadFunc only emit Log Type A,
> and
> >> >> filter other types of Log. This method is not so efficient, because
> it
> >> has
> >> >> to scan all the data for one type of Log type.
> >> >>
> >> >>
> >> >> Jeff Zhang
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Dec 24, 2009 at 1:43 AM, Gökhan Çapan <[email protected]>
> >> wrote:
> >> >>
> >> >> > No you didn't misunderstand, and thank you very much for these
> advice.
> >> >> But,
> >> >> > I couldn't understand what you meant in the 1st option.
> >> >> >
> >> >> > On Thu, Dec 24, 2009 at 11:36 AM, Jeff Zhang <[email protected]>
> >> wrote:
> >> >> >
> >> >> > > Hi Gökhan,  I assume your log is one record each line. And it
> seems
> >> >> that
> >> >> > > your logs have different types and different type of log have
> >> different
> >> >> > > fields. Then if you'd like to use pig for your case, I think you
> >> have
> >> >> > > several options:
> >> >> > >
> >> >> > > Option 1. You can create different LoadFunc for each type of your
> >> log,
> >> >> > > filter other types in the LoadFunc if they are not the type you
> >> want.
> >> >> > >
> >> >> > > Option 2. Split each type of logs into different files, then load
> >> logs
> >> >> > use
> >> >> > > each log type's respective LoadFunc
> >> >> > >
> >> >> > > Option 3. Do not split your log files, normalize your log's
> fields,
> >> >> here
> >> >> > > the
> >> >> > > normalization means merging the fields of all the log types into
> a
> >> >> large
> >> >> > > field set.  e.g. you have two type of logs, one has fields
> >> (A_1,A_2),
> >> >> > other
> >> >> > > has fields (B_1,B_2), then you can merge them into a large field
> >> set: (
> >> >> > > Log_Type, A_1, A_2, B_1, B_2).  And then split the logs in pig
> >> script
> >> >> > using
> >> >> > > the split statement.
> >> >> > >
> >> >> > >
> >> >> > > What method to used depend on your requirement and situation,
> hope I
> >> >> did
> >> >> > > not
> >> >> > > misunderstand your meaning.
> >> >> > >
> >> >> > >
> >> >> > > Jeff Zhang
> >> >> > >
> >> >> > >
> >> >> > > On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan <
> [email protected]>
> >> >> > wrote:
> >> >> > >
> >> >> > > > Hi, probably that was discussed before in this list, but i
> >> couldn't
> >> >> > find.
> >> >> > > > We are implementing log analysis tools for some web sites that
> >> have
> >> >> > high
> >> >> > > > traffic.
> >> >> > > > From now on, we want to use Pig to implement such analysis
> tools.
> >> >> > > >
> >> >> > > > We have millions of logs of a web site in a  session-URL-time
> >> format.
> >> >> > > > This is not just search logs, or just product views, but it
> >> consists
> >> >> of
> >> >> > > > different types of actions.
> >> >> > > >
> >> >> > > > For example, if a URL contains a specific pattern, we call it a
> >> >> search
> >> >> > > log,
> >> >> > > > etc.
> >> >> > > >
> >> >> > > > Until now, I was using a factory method to instantiate
> appropriate
> >> >> > > > URLHandler and after extracting some information from URL, I
> was
> >> >> > storing
> >> >> > > > this information to  the appropriate database table. For
> example
> >> if
> >> >> the
> >> >> > > > program decides a URL is a search log, it extracts session,
> query,
> >> >> > time,
> >> >> > > > corrects typos, determine implicit rating, goes to Search
> >> table(this
> >> >> is
> >> >> > a
> >> >> > > > relational database table), and store these to the table. If
> the
> >> >> > program
> >> >> > > > decides a URL is a product view log, it extracts session,
> >> member_id,
> >> >> > > > product_id, time, product title, rating for product, goes to
> >> >> > Product_View
> >> >> > > > table and stores it. After finishing storing, for example, it
> >> >> extracts
> >> >> > > > popular queries for assisting search.
> >> >> > > >
> >> >> > > > If I want to do all of these with Pig;
> >> >> > > > - Should I partition the global log file to separate
> >> >> files(search_logs
> >> >> > > and
> >> >> > > > product_view_logs are in seperate files)? or
> >> >> > > > - Can some pig commands load data, treat each tuple with its
> type
> >> >> (e.g.
> >> >> > > > This
> >> >> > > > is a search log and it should have "session-query-time-implicit
> >> >> > rating")
> >> >> > > > and
> >> >> > > > I can get rid of partitioning data for each type of log?
> >> >> > > >
> >> >> > > > I have just downloaded Pig and it seems it is able to do such
> >> tasks,
> >> >> > and
> >> >> > > I
> >> >> > > > will appreciate if anyone can show me a starting point for such
> an
> >> >> > > > application, and share some ideas.
> >> >> > > > Thank you.
> >> >> > > > --
> >> >> > > > Gökhan Çapan
> >> >> > > > Dilişim
> >> >> > > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Gökhan Çapan
> >> >> > Dilişim
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Gökhan Çapan
> >> > Dilişim
> >> >
> >>
> >
> >
> >
> > --
> > Gökhan Çapan
> > Dilişim
> >
>



-- 
Gökhan Çapan
Dilişim

Re: Some newbie questions

Reply via email to