Hadoop and processing of XML files

grad0584 Thu, 16 Nov 2006 07:07:17 -0800

Hello all,

I have been searching through the Hadoop mail archives, but I could not find a
relevant answer to my question.
Before starting, let me say that I am a newbie, just beginning to explore the
areas that Hadoop covers.


The problem:
We will be called to implement a system that should be able to index and access
possibly Tbytes of an ever-growing file-base.
The files are write-once, in structured XML format, ZIPed together in so-called
&#8220;dossiers&#8221; or maybe existing as free-floating files (not sure about 
that one, yet).
The bad thing is that no-one knows the complete set of indexing requirements yet
and probably never will. The most likely scenario is that, after the system’s
deployment, the client’s IT staff will be called to create ad-hoc queries over
this sea of data, re-index it, generate statistics, perform data mining etc.

At least from a very high-level perspective, all this reminds me very much of
the situations that MapReduce/GFS were invented to deal with.
The problem is that the data is structured and not almost-flat text files, as in
the case of a web crawling engine. We need to take into account the XML
structure of the documents for indexing to have any meaningful result.

The question is:
Can Hadoop be a help or a burden in developing such a system?
I am particularly concerned about the fact that Hadoop FS stores data in huge
blocks and the scheduler &#8220;cuts&#8221; it in arbitrary byte indexes prior 
to MapReduce.
This way, many XML files will co-exist in one block and one of them will
certainly be cut in half.
Does the Hadoop API and architecture in general give the developer of the
MapReduce functions a chance of reliably reconstructing the original files that
compose each block for some processing other than grep-like (in my case,
SAX-driven parsing)?
Any ideas on how this might be achieved or where I should start digging in the
Javadocs?

I am not asking if Hadoop can do this off-the-shelf, but if it can be achieved
with reasonable development effort.
Perhaps, the answers to my concerns/questions are right in front of my eyes in
the Javadocs/Wiki or, perhaps, in some modules/examples/code included in project
Nutch.
Perhaps I have not even understood correctly the concept of the MapReduce 
paradigm.

Feel free to send any idea, comment, link you think might help; I would welcome
and appreciate any hint in the right direction!
Thank you for your time.

Cheers,
S.

Stelios Gerogiannakis 


--

Hadoop and processing of XML files

Reply via email to