On 9/28/06, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:
It can be done! :) I'll see if I can contribute the code that I use for
these sorts of things... I've been parsing through nearly a terabyte of XML
on a regular basis using mapreduce since early this year.
One way to do it would be to define a variant of TextInputFormat that,
instead of using end-of-line to delimit elements, using a custom string,
perhaps a regexp. Just define a regexp for the XML element that you can
cleanly treat independently, and away you go. Might not split work as
ideally across block boundaries as you'd like, but it should be possible to
make it perform much better than single-machine single-parser work would.
Of course, worst cases kill you. It might be a problem if one of your
segments is a couple of gigabytes long, for instance, as is the case with my
dataset if you use the highest-level XML container. The current mapreduce
code really requires that your Writable instances fit in memory.
On 9/27/06, Vetle Roeim <[EMAIL PROTECTED]> wrote:
>
> On Wed, 27 Sep 2006 06:37:06 +0200, Feng Jiang <[EMAIL PROTECTED]>
> wrote:
>
> > One principle is that the input file must be a sequence of pairs, and
> you
> > must have a input formatter for the input file. otherwise you cannot use
> > mapreduce directly.
>
> Technically, yes, but that pair could just as well consist of XML and some
> insignificant value. The most obvious example is processing log files,
> where data might be read using TextInputFormat, and the line number simply
> discarded.
>
>
> [...]
> >> in my example, XML paring to CSV seems to be one-to-one mapping, e.g.
> >>
> >> <book>
> >> <title>hadoop</title>
> >> <author>peter</author>
> >> <ISBN>121332</ISBN>
> >> </book>
>
> In this case, you'd might have to write your own implementation of
> InputFormat that reads the entire XML fragment into some kind of data
> structure.
>
>
> >> would become (CSV)
> >> hadoop,peter,121332
> >>
> >> use mapreduce seems not suitable?
> >>
> >> thanks.
> >>
>
>
>
> --
> Vetle Roeim
> Team Manager, Information Systems
> Opera Software ASA <URL: http://www.opera.com/ >
>
--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp
hello,
this sound good. i would appreciate if you can share your code (just
the outline is okay) , becoz i am new to the madreduce, thanks.
regards,
howa