If this is a one-time operation in your pipeline and you are ok with splitting it, you might want to consider using hadoop directly and splitting based on multiple-output collector.
It can be a map-only job with a line record reader or similar, a map function which does the split as you were doing in the existing db code, and writing to appropriate output collector based on the type.
All further analysis can be through pig - which works on a more type specific schema aware form (assuming each type has a fixed schema, while initial jumble of types does not have uniform schema).
Not sure if it is practical since i have not used this for map-only jobs ... Regards, Mridul Gökhan Çapan wrote:
Hi, probably that was discussed before in this list, but i couldn't find. We are implementing log analysis tools for some web sites that have high traffic. From now on, we want to use Pig to implement such analysis tools. We have millions of logs of a web site in a session-URL-time format. This is not just search logs, or just product views, but it consists of different types of actions. For example, if a URL contains a specific pattern, we call it a search log, etc. Until now, I was using a factory method to instantiate appropriate URLHandler and after extracting some information from URL, I was storing this information to the appropriate database table. For example if the program decides a URL is a search log, it extracts session, query, time, corrects typos, determine implicit rating, goes to Search table(this is a relational database table), and store these to the table. If the program decides a URL is a product view log, it extracts session, member_id, product_id, time, product title, rating for product, goes to Product_View table and stores it. After finishing storing, for example, it extracts popular queries for assisting search. If I want to do all of these with Pig; - Should I partition the global log file to separate files(search_logs and product_view_logs are in seperate files)? or - Can some pig commands load data, treat each tuple with its type (e.g. This is a search log and it should have "session-query-time-implicit rating") and I can get rid of partitioning data for each type of log? I have just downloaded Pig and it seems it is able to do such tasks, and I will appreciate if anyone can show me a starting point for such an application, and share some ideas. Thank you.
