Option 4:

Your loader/parser, upon reading a line of logs, creates an
appropriate record with its type-specific fields, and emits
(type_specifier:int, data:tuple). Then split by the type specifier,
and apply type-specific schemas to the tuple after the split.

-Dmitriy

On Thu, Dec 24, 2009 at 3:26 AM, Gökhan Çapan <[email protected]> wrote:
> Yeah, I've got it now. That is scanning all data for each type.
> Unfortunately, I think I will have to do it in 1st or 3rd way.
> Thank you again, Jeff.
>
> On Thu, Dec 24, 2009 at 12:00 PM, Jeff Zhang <[email protected]> wrote:
>
>> The first option means Log Type A's LoadFunc only emit Log Type A, and
>> filter other types of Log. This method is not so efficient, because it has
>> to scan all the data for one type of Log type.
>>
>>
>> Jeff Zhang
>>
>>
>>
>> On Thu, Dec 24, 2009 at 1:43 AM, Gökhan Çapan <[email protected]> wrote:
>>
>> > No you didn't misunderstand, and thank you very much for these advice.
>> But,
>> > I couldn't understand what you meant in the 1st option.
>> >
>> > On Thu, Dec 24, 2009 at 11:36 AM, Jeff Zhang <[email protected]> wrote:
>> >
>> > > Hi Gökhan,  I assume your log is one record each line. And it seems
>> that
>> > > your logs have different types and different type of log have different
>> > > fields. Then if you'd like to use pig for your case, I think you have
>> > > several options:
>> > >
>> > > Option 1. You can create different LoadFunc for each type of your log,
>> > > filter other types in the LoadFunc if they are not the type you want.
>> > >
>> > > Option 2. Split each type of logs into different files, then load logs
>> > use
>> > > each log type's respective LoadFunc
>> > >
>> > > Option 3. Do not split your log files, normalize your log's fields,
>> here
>> > > the
>> > > normalization means merging the fields of all the log types into a
>> large
>> > > field set.  e.g. you have two type of logs, one has fields (A_1,A_2),
>> > other
>> > > has fields (B_1,B_2), then you can merge them into a large field set: (
>> > > Log_Type, A_1, A_2, B_1, B_2).  And then split the logs in pig script
>> > using
>> > > the split statement.
>> > >
>> > >
>> > > What method to used depend on your requirement and situation, hope I
>> did
>> > > not
>> > > misunderstand your meaning.
>> > >
>> > >
>> > > Jeff Zhang
>> > >
>> > >
>> > > On Thu, Dec 24, 2009 at 1:16 AM, Gökhan Çapan <[email protected]>
>> > wrote:
>> > >
>> > > > Hi, probably that was discussed before in this list, but i couldn't
>> > find.
>> > > > We are implementing log analysis tools for some web sites that have
>> > high
>> > > > traffic.
>> > > > From now on, we want to use Pig to implement such analysis tools.
>> > > >
>> > > > We have millions of logs of a web site in a  session-URL-time format.
>> > > > This is not just search logs, or just product views, but it consists
>> of
>> > > > different types of actions.
>> > > >
>> > > > For example, if a URL contains a specific pattern, we call it a
>> search
>> > > log,
>> > > > etc.
>> > > >
>> > > > Until now, I was using a factory method to instantiate appropriate
>> > > > URLHandler and after extracting some information from URL, I was
>> > storing
>> > > > this information to  the appropriate database table. For example if
>> the
>> > > > program decides a URL is a search log, it extracts session, query,
>> > time,
>> > > > corrects typos, determine implicit rating, goes to Search table(this
>> is
>> > a
>> > > > relational database table), and store these to the table. If the
>> > program
>> > > > decides a URL is a product view log, it extracts session, member_id,
>> > > > product_id, time, product title, rating for product, goes to
>> > Product_View
>> > > > table and stores it. After finishing storing, for example, it
>> extracts
>> > > > popular queries for assisting search.
>> > > >
>> > > > If I want to do all of these with Pig;
>> > > > - Should I partition the global log file to separate
>> files(search_logs
>> > > and
>> > > > product_view_logs are in seperate files)? or
>> > > > - Can some pig commands load data, treat each tuple with its type
>> (e.g.
>> > > > This
>> > > > is a search log and it should have "session-query-time-implicit
>> > rating")
>> > > > and
>> > > > I can get rid of partitioning data for each type of log?
>> > > >
>> > > > I have just downloaded Pig and it seems it is able to do such tasks,
>> > and
>> > > I
>> > > > will appreciate if anyone can show me a starting point for such an
>> > > > application, and share some ideas.
>> > > > Thank you.
>> > > > --
>> > > > Gökhan Çapan
>> > > > Dilişim
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Gökhan Çapan
>> > Dilişim
>> >
>>
>
>
>
> --
> Gökhan Çapan
> Dilişim
>

Reply via email to