On Tue, May 24, 2011 at 6:23 PM, Aleksandr Elbakyan <[email protected]> wrote: > Hello, > > We currently have complicated process which has more then 20 jobs piped to > each other. > We are using shell script to control the flow, I saw some other company they > were using spring batch. We use pig, streaming and hive > > Not one thing if you are using ec2 for your jobs all local files need to be > stored in /mnt Currently our cluster is organized this way in hdfs, and we > process our data hourly and rotate final result to the beginning of pipeline > for next process. Each process output is next process input, so we keep all > data for current execution in the same dated folder so if you run daily it > will be eg 20111212 if hourly 201112121416, and add subfolder for each > subprocess into it > Example > /user/{domain}/{date}/input > /user/{domain}/{date}/process1 > /user/{domain}/{date}/process2 > /user/{domain}/{date}/process3 > /user/{domain}/{date}/process4 > our process1 takes input for current converted files and output from last > process. > After we start the job we load converted files into input location and move > them out from local space so we will not reprocess them. > > Not sure if there is examples, this all will depend on architecture of the > project you are doing, I bet if you will put all you need to do on whiteboard > you will find best folder structure for yourself :)
Thanks for the info! Is there a way to point me to an example where output of one process is then piped to other. For eg: Results of day 1 is stored in the day1/output directory then how do I input these results to day2's job? I am assuming these results will directly go to the reducer since it has already gone through map-reduce cycle. Just looking for some example to get little more feel of how to do it. So far I have installed hadoop and gone through basic hadoop tutorial of word count but I still lack knowledge of some important features. > > Regards, > Aleksandr > > > --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote: > > From: Mohit Anchlia <[email protected]> > Subject: Re: Processing xml files > To: [email protected] > Date: Tuesday, May 24, 2011, 5:20 PM > > Thanks some more questions :) > > On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan <[email protected]> > wrote: >> Can you please give more info? >>>> We currently have off hadoop process which uses java xml parser to convert >>>> it to flat file. We have files from couple kb to 10of GB. > > Do you convert it into a flat file and write it to HDFS? Do you write > all the files to the same directory in DFS or do you group directories > based on days for eg? So like 2011/01/01 contains 10 files. store > results of 10 files somewhere and then on 2011/02/02 store another say > 20 files. Now analyze 20 files and use the results from 10 files to do > the aggregation. If so then how do you do it. Or how should I do it > since it will be overhead processing those files again. > > Please point me to examples so that you don't have to teach me Hadoop > or pig processing :) > >> >> Do you append multiple xml files data as a line into one file? Or >> someother way? If so then how big do you let files to be. >> >> >> We currently feed to our process folder with converted files. We don't size >> it any way we let hadoop to handle it. > > Didn't think about it. I was just thinking in terms of using big > files. So when using small files hadoop will automatically distribute > the files accross cluster I am assuming based on some hashing. > >> >> how do you create these files assuming your xml is stored somewhere >> else in the DB or filesystem? read them one by one? >> >> >> what are your experiences using text files instead of xml? >> If you are using streaming job it is easier to build your logic if you have >> one file, you can actually try to parse xml in your mapper and convert it >> for reducer but, why you just don't write small app which will convert it? > > >> >> Reason why xml files can't be directly used in hadoop or shouldn't be used? >> Any performance implications? >> If you are using Pig there is XML reader >> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html >> > > Which one is better? Converting files to flat files or using xml as > is? How do I make that decision? > > >> If you have well define schema it is easier to work with big data :) >> >> Any readings suggested in this area? >> Try look into Pig it has lots of useful stuff, which will make your >> experience with hadoop nicer > > I will download pig tutorial and see how that works. Is there any > other xml related examples you can point me to? > > Thanks a lot! >> >> Our xml is something like: >> >> <column id="Name" security="sensitive" xsi:type="Text"> >> <value>free a last</value> >> </column> >> <column id="age" security="no" xsi:type="Text"> >> <value>40</value> >> </column> >> >> And we would for eg want to know how many customers above certain age >> or certain age with certain income etc. >> >> Hadoop has build in counter, did you look into word count example from >> hadoop? >> >> >> Regards, >> Aleksandr >> >> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote: >> >> From: Mohit Anchlia <[email protected]> >> Subject: Re: Processing xml files >> To: [email protected] >> Date: Tuesday, May 24, 2011, 4:41 PM >> >> On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan <[email protected]> >> wrote: >>> Hello, >>> >>> We have the same type of data, we currently convert it to tab delimited >>> file and use it as input for streaming >>> >> >> Can you please give more info? >> Do you append multiple xml files data as a line into one file? Or >> someother way? If so then how big do you let files to be. >> how do you create these files assuming your xml is stored somewhere >> else in the DB or filesystem? read them one by one? >> what are your experiences using text files instead of xml? >> Reason why xml files can't be directly used in hadoop or shouldn't be used? >> Any performance implications? >> Any readings suggested in this area? >> >> Our xml is something like: >> >> <column id="Name" security="sensitive" xsi:type="Text"> >> <value>free a last</value> >> </column> >> <column id="age" security="no" xsi:type="Text"> >> <value>40</value> >> </column> >> >> And we would for eg want to know how many customers above certain age >> or certain age with certain income etc. >> >> Sorry for all the questions. I am new and trying to get a grasp and >> also learn how would I actually solve our use case. >> >>> Regards, >>> Aleksandr >>> >>> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote: >>> >>> From: Mohit Anchlia <[email protected]> >>> Subject: Processing xml files >>> To: [email protected] >>> Date: Tuesday, May 24, 2011, 4:16 PM >>> >>> I just started learning hadoop and got done with wordcount mapreduce >>> example. I also briefly looked at hadoop streaming. >>> >>> Some questions >>> 1) What should be my first step now? Are there more examples >>> somewhere that I can try out? >>> 2) Second question is around pracitcal usability using xml files. Our >>> xml files are not big they are around 120k in size but hadoop is >>> really meant for big files so how do I go about processing these xml >>> files? >>> 3) Are there any samples or advise on how to processing with xml files? >>> >>> >>> Looking for help and pointers. >>> >> >
