Option 4: Your loader/parser, upon reading a line of logs, creates an appropriate record with its type-specific fields, and emits (type_specifier:int, data:tuple). Then split by the type specifier, and apply type-specific schemas to the tuple after the split.
-Dmitriy On Thu, Dec 24, 2009 at 3:26 AM, Gökhan Çapan <[email protected]> wrote: > Yeah, I've got it now. That is scanning all data for each type. > Unfortunately, I think I will have to do it in 1st or 3rd way. > Thank you again, Jeff. > > On Thu, Dec 24, 2009 at 12:00 PM, Jeff Zhang <[email protected]> wrote: > >> The first option means Log Type A's LoadFunc only emit Log Type A, and >> filter other types of Log. This method is not so efficient, because it has >> to scan all the data for one type of Log type. >> >> >> Jeff Zhang >> >> >> >> On Thu, Dec 24, 2009 at 1:43 AM, Gökhan Çapan <[email protected]> wrote: >> >> > No you didn't misunderstand, and thank you very much for these advice. >> But, >> > I couldn't understand what you meant in the 1st option. >> > >> > On Thu, Dec 24, 2009 at 11:36 AM, Jeff Zhang <[email protected]> wrote: >> > >> > > Hi Gökhan, I assume your log is one record each line. And it seems >> that >> > > your logs have different types and different type of log have different >> > > fields. Then if you'd like to use pig for your case, I think you have >> > > several options: >> > > >> > > Option 1. You can create different LoadFunc for each type of your log, >> > > filter other types in the LoadFunc if they are not the type you want. >> > > >> > > Option 2. Split each type of logs into different files, then load logs >> > use >> > > each log type's respective LoadFunc >> > > >> > > Option 3. Do not split your log files, normalize your log's fields, >> here >> > > the >> > > normalization means merging the fields of all the log types into a >> large >> > > field set. e.g. you have two type of logs, one has fields (A_1,A_2), >> > other >> > > has fields (B_1,B_2), then you can merge them into a large field set: ( >> > > Log_Type, A_1, A_2, B_1, B_2). And then split the logs in pig script >> > using >> > > the split statement. >> > > >> > > >> > > What method to used depend on your requirement and situation, hope I >> did >> > > not >> > > misunderstand your meaning. >> > > >> > > >> > > Jeff Zhang >> > > >> > > >> > > On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan <[email protected]> >> > wrote: >> > > >> > > > Hi, probably that was discussed before in this list, but i couldn't >> > find. >> > > > We are implementing log analysis tools for some web sites that have >> > high >> > > > traffic. >> > > > From now on, we want to use Pig to implement such analysis tools. >> > > > >> > > > We have millions of logs of a web site in a session-URL-time format. >> > > > This is not just search logs, or just product views, but it consists >> of >> > > > different types of actions. >> > > > >> > > > For example, if a URL contains a specific pattern, we call it a >> search >> > > log, >> > > > etc. >> > > > >> > > > Until now, I was using a factory method to instantiate appropriate >> > > > URLHandler and after extracting some information from URL, I was >> > storing >> > > > this information to the appropriate database table. For example if >> the >> > > > program decides a URL is a search log, it extracts session, query, >> > time, >> > > > corrects typos, determine implicit rating, goes to Search table(this >> is >> > a >> > > > relational database table), and store these to the table. If the >> > program >> > > > decides a URL is a product view log, it extracts session, member_id, >> > > > product_id, time, product title, rating for product, goes to >> > Product_View >> > > > table and stores it. After finishing storing, for example, it >> extracts >> > > > popular queries for assisting search. >> > > > >> > > > If I want to do all of these with Pig; >> > > > - Should I partition the global log file to separate >> files(search_logs >> > > and >> > > > product_view_logs are in seperate files)? or >> > > > - Can some pig commands load data, treat each tuple with its type >> (e.g. >> > > > This >> > > > is a search log and it should have "session-query-time-implicit >> > rating") >> > > > and >> > > > I can get rid of partitioning data for each type of log? >> > > > >> > > > I have just downloaded Pig and it seems it is able to do such tasks, >> > and >> > > I >> > > > will appreciate if anyone can show me a starting point for such an >> > > > application, and share some ideas. >> > > > Thank you. >> > > > -- >> > > > Gökhan Çapan >> > > > Dilişim >> > > > >> > > >> > >> > >> > >> > -- >> > Gökhan Çapan >> > Dilişim >> > >> > > > > -- > Gökhan Çapan > Dilişim >
