Re: Processing xml files

Mohit Anchlia Wed, 25 May 2011 09:42:09 -0700

On Tue, May 24, 2011 at 6:23 PM, Aleksandr Elbakyan <[email protected]> wrote:
> Hello,
>
> We currently have complicated process which has more then 20 jobs piped to 
> each other.
> We are using shell script to control the flow, I saw some other company they 
> were using spring batch. We use pig, streaming and hive
>
>  Not one thing if you are using ec2 for your jobs all local files need to be 
> stored in /mnt Currently our cluster is organized this way in hdfs, and we 
> process our data hourly and rotate final result to the beginning of pipeline 
> for next process. Each process output is next process input, so we keep all 
> data for current execution in the same dated folder so if you run daily it 
> will be eg 20111212 if hourly 201112121416, and add subfolder for each 
> subprocess into it
> Example
> /user/{domain}/{date}/input
>  /user/{domain}/{date}/process1
> /user/{domain}/{date}/process2
> /user/{domain}/{date}/process3
> /user/{domain}/{date}/process4
> our process1 takes input for current converted files and output from last 
> process.
> After we start the job we load converted files into input location and move 
> them out from local space so we will not reprocess them.
>
> Not sure if there is examples, this all will depend on architecture of the 
> project you are doing, I bet if you will put all you need to do on whiteboard 
> you will find best folder structure for yourself :)


Thanks for the info! Is there a way to point me to an example where
output of one process is then piped to other. For eg: Results of day 1
is stored in the day1/output directory then how do I input these
results to day2's job? I am assuming these results will directly go to
the reducer since it has already gone through map-reduce cycle. Just
looking for some example to get little more feel of how to do it. So
far I have installed hadoop and gone through basic hadoop tutorial of
word count but I still lack knowledge of some important features.

>
> Regards,
> Aleksandr
>
>
> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote:
>
> From: Mohit Anchlia <[email protected]>
> Subject: Re: Processing xml files
> To: [email protected]
> Date: Tuesday, May 24, 2011, 5:20 PM
>
> Thanks some more questions :)
>
> On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan <[email protected]> 
> wrote:
>> Can you please give more info?
>>>> We currently have off hadoop process which uses java xml parser to convert 
>>>> it to flat file. We have files from couple kb to 10of GB.
>
> Do you convert it into a flat file and write it to HDFS? Do you write
> all the files to the same directory in DFS or do you group directories
> based on days for eg? So like 2011/01/01 contains 10 files. store
> results of 10 files somewhere and then on 2011/02/02 store another say
> 20 files. Now analyze 20 files and use the results from 10 files to do
> the aggregation. If so then how do you do it. Or how should I do it
> since it will be overhead processing those files again.
>
> Please point me to examples so that you don't have to teach me Hadoop
> or pig processing :)
>
>>
>> Do you append multiple xml files data as a line into one file? Or
>> someother way? If so then how big do you let files to be.
>>
>>
>> We currently feed to our process folder with converted files. We don't size 
>> it any way we let hadoop to handle it.
>
> Didn't think about it. I was just thinking in terms of using big
> files. So when using small files hadoop will automatically distribute
> the files accross cluster I am assuming based on some hashing.
>
>>
>> how do you create these files assuming your xml is stored somewhere
>> else in the DB or filesystem? read them one by one?
>>
>>
>> what are your experiences using text files instead of xml?
>> If you are using streaming job it is easier to build your logic if you have 
>> one file, you can actually try to parse xml in your mapper and convert it 
>> for reducer but, why you just don't write small app which will convert it?
>
>
>>
>> Reason why xml files can't be directly used in hadoop or shouldn't be used?
>> Any performance implications?
>> If you are using Pig there is XML reader 
>> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html
>>
>
> Which one is better? Converting files to flat files or using xml as
> is? How do I make that decision?
>
>
>> If you have well define schema it is easier to work with big data :)
>>
>> Any readings suggested in this area?
>> Try look into Pig it has lots of useful stuff, which will make your 
>> experience with hadoop nicer
>
> I will download pig tutorial and see how that works. Is there any
> other xml related examples you can point me to?
>
> Thanks a lot!
>>
>> Our xml  is something like:
>>
>>   <column id="Name" security="sensitive" xsi:type="Text">
>>    <value>free a last</value>
>>   </column>
>>   <column id="age" security="no" xsi:type="Text">
>>    <value>40</value>
>>   </column>
>>
>> And we would for eg want to know how many customers above certain age
>> or certain age with certain income etc.
>>
>> Hadoop has build in counter, did you look into word count example from 
>> hadoop?
>>
>>
>> Regards,
>> Aleksandr
>>
>> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote:
>>
>> From: Mohit Anchlia <[email protected]>
>> Subject: Re: Processing xml files
>> To: [email protected]
>> Date: Tuesday, May 24, 2011, 4:41 PM
>>
>> On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan <[email protected]> 
>> wrote:
>>> Hello,
>>>
>>>  We have the same type of data, we currently convert it to tab delimited 
>>> file and use it as input for streaming
>>>
>>
>> Can you please give more info?
>> Do you append multiple xml files data as a line into one file? Or
>> someother way? If so then how big do you let files to be.
>> how do you create these files assuming your xml is stored somewhere
>> else in the DB or filesystem? read them one by one?
>> what are your experiences using text files instead of xml?
>> Reason why xml files can't be directly used in hadoop or shouldn't be used?
>> Any performance implications?
>> Any readings suggested in this area?
>>
>> Our xml  is something like:
>>
>>   <column id="Name" security="sensitive" xsi:type="Text">
>>    <value>free a last</value>
>>   </column>
>>   <column id="age" security="no" xsi:type="Text">
>>    <value>40</value>
>>   </column>
>>
>> And we would for eg want to know how many customers above certain age
>> or certain age with certain income etc.
>>
>> Sorry for all the questions. I am new and trying to get a grasp and
>> also learn how would I actually solve our use case.
>>
>>> Regards,
>>> Aleksandr
>>>
>>> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote:
>>>
>>> From: Mohit Anchlia <[email protected]>
>>> Subject: Processing xml files
>>> To: [email protected]
>>> Date: Tuesday, May 24, 2011, 4:16 PM
>>>
>>> I just started learning hadoop and got done with wordcount mapreduce
>>> example. I also briefly looked at hadoop streaming.
>>>
>>> Some questions
>>> 1) What should  be my first step now? Are there more examples
>>> somewhere that I can try out?
>>> 2) Second question is around pracitcal usability using xml files. Our
>>> xml files are not big they are around 120k in size but hadoop is
>>> really meant for big files so how do I go about processing these xml
>>> files?
>>> 3) Are there any samples or advise on how to processing with xml files?
>>>
>>>
>>> Looking for help and pointers.
>>>
>>
>

Re: Processing xml files

Reply via email to