Hello everyone, I am building an hadoop "app" to quickly index a corpus of documents. This app will accept one or more XML file that will contain the corpus. Each document is made up of several section: title, authors, body...these section are not static and depend on the collection. Here's a sample glimpse of how the xml input file looks like:
<document id='1'> <field name='title'> the divine comedy </field> <field name='author'>Dante</field> <field name='body'>halfway along our life's path.......</field> </document> <document id='2'> ... </document> I would like to discuss some implementation choices: - which is the best way to "tell" my hadoop app which section to expect between <document> and </document> tags? - is it more appropriate to implement a record reader that passes to the mapper the whole content of the document tag or section by section. I was wondering which parser to use, a dom-like one or a sax-like one...any library (efficient) to recommend? - do you know any library I could use to process text? By text processing I mean common preprocessing operation like tokenization, stopword elimination...I was thinking of using lucene's engine...can it be a bottleneck? I am looking forward to read your opinion Thanks, Marco
