That is exactly what I need. I was wondering the flexibility in such user defined implementations. Thanks, Dmitriy.
2009/12/24 Dmitriy Ryaboy <[email protected]> > Right, I was thinking you would put all that logic into the LoadFunc. > > So, in pseudo-code: > > MyLoadFunc.getNext(): > string = fetch_next_string(); > id = determine_record_type(); > switch (id) : > case 1: tuple = process_type_1(string); break > case 2: tuple = process_type_2(string); break; > etc > return new Tuple(id, tuple); > > process_type_1 and process_type_2 can do completely different things. > > Now, in pig: > > data = load '/logs/today' using MyLoadFunc as (id:int, dataTuple:tuple()); > split data into type1==1 if id=1, type2 if type==2; > one_w_types = foreach type1 generate $0, (double) $1.$0 as price, > (int) $1.$1 as product_id, (chararray) $1.$2 as description; > > > and so on. > > -D > > > > 2009/12/24 Gökhan Çapan <[email protected]>: > > In addition, the tuples seem same. Each line has a "id-session-URL-time" > > format. > > My problem is, reading a row, deciding which type it is and storing it > with > > a different format. > > > > For example if I read a row and decided it is a search log, I want to > > extract 3 additional fields from the URL and store it this way. > > If it is in a different type, I will extract 4-5 new fields from the URL > and > > store it this way. > > > > i mean all input is "id-session-URL-time". (Names of fields in input > lines > > are not distinguishing) > > > > one output is:"id-session-A1-A2-A3-time" > > one output is:"id-session-B1-B2-B3-B4-time" > > . > > . > > . > > * > > -read data as id,session,URL,time fields. > > -do some process on URL and make "URL" a container of different > subfields. > > -store them.* > > for example that may be a solution. Is that possible? > > > > > > 2009/12/24 Dmitriy Ryaboy <[email protected]> > > > >> Option 4: > >> > >> Your loader/parser, upon reading a line of logs, creates an > >> appropriate record with its type-specific fields, and emits > >> (type_specifier:int, data:tuple). Then split by the type specifier, > >> and apply type-specific schemas to the tuple after the split. > >> > >> -Dmitriy > >> > >> On Thu, Dec 24, 2009 at 3:26 AM, Gökhan Çapan <[email protected]> > wrote: > >> > Yeah, I've got it now. That is scanning all data for each type. > >> > Unfortunately, I think I will have to do it in 1st or 3rd way. > >> > Thank you again, Jeff. > >> > > >> > On Thu, Dec 24, 2009 at 12:00 PM, Jeff Zhang <[email protected]> > wrote: > >> > > >> >> The first option means Log Type A's LoadFunc only emit Log Type A, > and > >> >> filter other types of Log. This method is not so efficient, because > it > >> has > >> >> to scan all the data for one type of Log type. > >> >> > >> >> > >> >> Jeff Zhang > >> >> > >> >> > >> >> > >> >> On Thu, Dec 24, 2009 at 1:43 AM, Gökhan Çapan <[email protected]> > >> wrote: > >> >> > >> >> > No you didn't misunderstand, and thank you very much for these > advice. > >> >> But, > >> >> > I couldn't understand what you meant in the 1st option. > >> >> > > >> >> > On Thu, Dec 24, 2009 at 11:36 AM, Jeff Zhang <[email protected]> > >> wrote: > >> >> > > >> >> > > Hi Gökhan, I assume your log is one record each line. And it > seems > >> >> that > >> >> > > your logs have different types and different type of log have > >> different > >> >> > > fields. Then if you'd like to use pig for your case, I think you > >> have > >> >> > > several options: > >> >> > > > >> >> > > Option 1. You can create different LoadFunc for each type of your > >> log, > >> >> > > filter other types in the LoadFunc if they are not the type you > >> want. > >> >> > > > >> >> > > Option 2. Split each type of logs into different files, then load > >> logs > >> >> > use > >> >> > > each log type's respective LoadFunc > >> >> > > > >> >> > > Option 3. Do not split your log files, normalize your log's > fields, > >> >> here > >> >> > > the > >> >> > > normalization means merging the fields of all the log types into > a > >> >> large > >> >> > > field set. e.g. you have two type of logs, one has fields > >> (A_1,A_2), > >> >> > other > >> >> > > has fields (B_1,B_2), then you can merge them into a large field > >> set: ( > >> >> > > Log_Type, A_1, A_2, B_1, B_2). And then split the logs in pig > >> script > >> >> > using > >> >> > > the split statement. > >> >> > > > >> >> > > > >> >> > > What method to used depend on your requirement and situation, > hope I > >> >> did > >> >> > > not > >> >> > > misunderstand your meaning. > >> >> > > > >> >> > > > >> >> > > Jeff Zhang > >> >> > > > >> >> > > > >> >> > > On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan < > [email protected]> > >> >> > wrote: > >> >> > > > >> >> > > > Hi, probably that was discussed before in this list, but i > >> couldn't > >> >> > find. > >> >> > > > We are implementing log analysis tools for some web sites that > >> have > >> >> > high > >> >> > > > traffic. > >> >> > > > From now on, we want to use Pig to implement such analysis > tools. > >> >> > > > > >> >> > > > We have millions of logs of a web site in a session-URL-time > >> format. > >> >> > > > This is not just search logs, or just product views, but it > >> consists > >> >> of > >> >> > > > different types of actions. > >> >> > > > > >> >> > > > For example, if a URL contains a specific pattern, we call it a > >> >> search > >> >> > > log, > >> >> > > > etc. > >> >> > > > > >> >> > > > Until now, I was using a factory method to instantiate > appropriate > >> >> > > > URLHandler and after extracting some information from URL, I > was > >> >> > storing > >> >> > > > this information to the appropriate database table. For > example > >> if > >> >> the > >> >> > > > program decides a URL is a search log, it extracts session, > query, > >> >> > time, > >> >> > > > corrects typos, determine implicit rating, goes to Search > >> table(this > >> >> is > >> >> > a > >> >> > > > relational database table), and store these to the table. If > the > >> >> > program > >> >> > > > decides a URL is a product view log, it extracts session, > >> member_id, > >> >> > > > product_id, time, product title, rating for product, goes to > >> >> > Product_View > >> >> > > > table and stores it. After finishing storing, for example, it > >> >> extracts > >> >> > > > popular queries for assisting search. > >> >> > > > > >> >> > > > If I want to do all of these with Pig; > >> >> > > > - Should I partition the global log file to separate > >> >> files(search_logs > >> >> > > and > >> >> > > > product_view_logs are in seperate files)? or > >> >> > > > - Can some pig commands load data, treat each tuple with its > type > >> >> (e.g. > >> >> > > > This > >> >> > > > is a search log and it should have "session-query-time-implicit > >> >> > rating") > >> >> > > > and > >> >> > > > I can get rid of partitioning data for each type of log? > >> >> > > > > >> >> > > > I have just downloaded Pig and it seems it is able to do such > >> tasks, > >> >> > and > >> >> > > I > >> >> > > > will appreciate if anyone can show me a starting point for such > an > >> >> > > > application, and share some ideas. > >> >> > > > Thank you. > >> >> > > > -- > >> >> > > > Gökhan Çapan > >> >> > > > Dilişim > >> >> > > > > >> >> > > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Gökhan Çapan > >> >> > Dilişim > >> >> > > >> >> > >> > > >> > > >> > > >> > -- > >> > Gökhan Çapan > >> > Dilişim > >> > > >> > > > > > > > > -- > > Gökhan Çapan > > Dilişim > > > -- Gökhan Çapan Dilişim
