Look at the Reuters example in the Mahout project: http://mahout.apache.org
On Fri, Jan 28, 2011 at 2:49 AM, Marco Didonna <[email protected]> wrote: > Hello everyone, > I am building an hadoop "app" to quickly index a corpus of documents. > This app will accept one or more XML file that will contain the corpus. > Each document is made up of several section: title, authors, > body...these section are not static and depend on the collection. Here's > a sample glimpse of how the xml input file looks like: > > <document id='1'> > <field name='title'> the divine comedy </field> > <field name='author'>Dante</field> > <field name='body'>halfway along our life's path.......</field> > </document> > <document id='2'> > > ... > > </document> > > I would like to discuss some implementation choices: > > - which is the best way to "tell" my hadoop app which section to expect > between <document> and </document> tags? > > - is it more appropriate to implement a record reader that passes to the > mapper the whole content of the document tag or section by section. I > was wondering which parser to use, a dom-like one or a sax-like > one...any library (efficient) to recommend? > > - do you know any library I could use to process text? By text > processing I mean common preprocessing operation like tokenization, > stopword elimination...I was thinking of using lucene's engine...can it > be a bottleneck? > > I am looking forward to read your opinion > > Thanks, > > Marco > > -- Lance Norskog [email protected]
