Hello all, I have been searching through the Hadoop mail archives, but I could not find a relevant answer to my question. Before starting, let me say that I am a newbie, just beginning to explore the areas that Hadoop covers.
The problem: We will be called to implement a system that should be able to index and access possibly Tbytes of an ever-growing file-base. The files are write-once, in structured XML format, ZIPed together in so-called “dossiers” or maybe existing as free-floating files (not sure about that one, yet). The bad thing is that no-one knows the complete set of indexing requirements yet and probably never will. The most likely scenario is that, after the system’s deployment, the client’s IT staff will be called to create ad-hoc queries over this sea of data, re-index it, generate statistics, perform data mining etc. At least from a very high-level perspective, all this reminds me very much of the situations that MapReduce/GFS were invented to deal with. The problem is that the data is structured and not almost-flat text files, as in the case of a web crawling engine. We need to take into account the XML structure of the documents for indexing to have any meaningful result. The question is: Can Hadoop be a help or a burden in developing such a system? I am particularly concerned about the fact that Hadoop FS stores data in huge blocks and the scheduler “cuts” it in arbitrary byte indexes prior to MapReduce. This way, many XML files will co-exist in one block and one of them will certainly be cut in half. Does the Hadoop API and architecture in general give the developer of the MapReduce functions a chance of reliably reconstructing the original files that compose each block for some processing other than grep-like (in my case, SAX-driven parsing)? Any ideas on how this might be achieved or where I should start digging in the Javadocs? I am not asking if Hadoop can do this off-the-shelf, but if it can be achieved with reasonable development effort. Perhaps, the answers to my concerns/questions are right in front of my eyes in the Javadocs/Wiki or, perhaps, in some modules/examples/code included in project Nutch. Perhaps I have not even understood correctly the concept of the MapReduce paradigm. Feel free to send any idea, comment, link you think might help; I would welcome and appreciate any hint in the right direction! Thank you for your time. Cheers, S. Stelios Gerogiannakis --
