Re: Processing xml files

Mohit Anchlia Tue, 24 May 2011 17:21:21 -0700

Thanks some more questions :)

On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan <[email protected]> wrote:
> Can you please give more info?
>>> We currently have off hadoop process which uses java xml parser to convert 
>>> it to flat file. We have files from couple kb to 10of GB.


Do you convert it into a flat file and write it to HDFS? Do you write
all the files to the same directory in DFS or do you group directories
based on days for eg? So like 2011/01/01 contains 10 files. store
results of 10 files somewhere and then on 2011/02/02 store another say
20 files. Now analyze 20 files and use the results from 10 files to do
the aggregation. If so then how do you do it. Or how should I do it
since it will be overhead processing those files again.

Please point me to examples so that you don't have to teach me Hadoop
or pig processing :)

>
> Do you append multiple xml files data as a line into one file? Or
> someother way? If so then how big do you let files to be.
>
>
> We currently feed to our process folder with converted files. We don't size 
> it any way we let hadoop to handle it.

Didn't think about it. I was just thinking in terms of using big
files. So when using small files hadoop will automatically distribute
the files accross cluster I am assuming based on some hashing.

>
> how do you create these files assuming your xml is stored somewhere
> else in the DB or filesystem? read them one by one?
>
>
> what are your experiences using text files instead of xml?
> If you are using streaming job it is easier to build your logic if you have 
> one file, you can actually try to parse xml in your mapper and convert it for 
> reducer but, why you just don't write small app which will convert it?


>
> Reason why xml files can't be directly used in hadoop or shouldn't be used?
> Any performance implications?
> If you are using Pig there is XML reader 
> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html
>

Which one is better? Converting files to flat files or using xml as
is? How do I make that decision?


> If you have well define schema it is easier to work with big data :)
>
> Any readings suggested in this area?
> Try look into Pig it has lots of useful stuff, which will make your 
> experience with hadoop nicer

I will download pig tutorial and see how that works. Is there any
other xml related examples you can point me to?

Thanks a lot!
>
> Our xml  is something like:
>
>   <column id="Name" security="sensitive" xsi:type="Text">
>    <value>free a last</value>
>   </column>
>   <column id="age" security="no" xsi:type="Text">
>    <value>40</value>
>   </column>
>
> And we would for eg want to know how many customers above certain age
> or certain age with certain income etc.
>
> Hadoop has build in counter, did you look into word count example from hadoop?
>
>
> Regards,
> Aleksandr
>
> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote:
>
> From: Mohit Anchlia <[email protected]>
> Subject: Re: Processing xml files
> To: [email protected]
> Date: Tuesday, May 24, 2011, 4:41 PM
>
> On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan <[email protected]> 
> wrote:
>> Hello,
>>
>>  We have the same type of data, we currently convert it to tab delimited 
>> file and use it as input for streaming
>>
>
> Can you please give more info?
> Do you append multiple xml files data as a line into one file? Or
> someother way? If so then how big do you let files to be.
> how do you create these files assuming your xml is stored somewhere
> else in the DB or filesystem? read them one by one?
> what are your experiences using text files instead of xml?
> Reason why xml files can't be directly used in hadoop or shouldn't be used?
> Any performance implications?
> Any readings suggested in this area?
>
> Our xml  is something like:
>
>   <column id="Name" security="sensitive" xsi:type="Text">
>    <value>free a last</value>
>   </column>
>   <column id="age" security="no" xsi:type="Text">
>    <value>40</value>
>   </column>
>
> And we would for eg want to know how many customers above certain age
> or certain age with certain income etc.
>
> Sorry for all the questions. I am new and trying to get a grasp and
> also learn how would I actually solve our use case.
>
>> Regards,
>> Aleksandr
>>
>> --- On Tue, 5/24/11, Mohit Anchlia <[email protected]> wrote:
>>
>> From: Mohit Anchlia <[email protected]>
>> Subject: Processing xml files
>> To: [email protected]
>> Date: Tuesday, May 24, 2011, 4:16 PM
>>
>> I just started learning hadoop and got done with wordcount mapreduce
>> example. I also briefly looked at hadoop streaming.
>>
>> Some questions
>> 1) What should  be my first step now? Are there more examples
>> somewhere that I can try out?
>> 2) Second question is around pracitcal usability using xml files. Our
>> xml files are not big they are around 120k in size but hadoop is
>> really meant for big files so how do I go about processing these xml
>> files?
>> 3) Are there any samples or advise on how to processing with xml files?
>>
>>
>> Looking for help and pointers.
>>
>

Re: Processing xml files

Reply via email to