Look at the Reuters example in the Mahout project: http://mahout.apache.org

On Fri, Jan 28, 2011 at 2:49 AM, Marco Didonna <[email protected]> wrote:
> Hello everyone,
> I am building an hadoop "app" to quickly index a corpus of documents.
> This app will accept one or more XML file that will contain the corpus.
> Each document is made up of several section: title, authors,
> body...these section are not static and depend on the collection. Here's
> a sample glimpse of how the xml input file looks like:
>
> <document id='1'>
> <field name='title'> the divine comedy </field>
> <field name='author'>Dante</field>
> <field name='body'>halfway along our life's path.......</field>
> </document>
> <document id='2'>
>
> ...
>
> </document>
>
> I would like to discuss some implementation choices:
>
> - which is the best way to "tell" my hadoop app which section to expect
> between <document> and </document> tags?
>
> - is it more appropriate to implement a record reader that passes to the
> mapper the whole content of the document tag or section by section. I
> was wondering which parser to use, a dom-like one or a sax-like
> one...any library (efficient) to recommend?
>
> - do you know any library I could use to process text? By text
> processing I mean common preprocessing operation like tokenization,
> stopword elimination...I was thinking of using lucene's engine...can it
> be a bottleneck?
>
> I am looking forward to read your opinion
>
> Thanks,
>
> Marco
>
>



-- 
Lance Norskog
[email protected]

Reply via email to