> > Do you think I can get any advantage from building a solution on > Lucene?
Lucene is generally about information retrieval not information extraction (as suggested, GATE or UIMA are more commonly used for extraction). However, Lucene can play a role in extraction if you use it for determining probabilities rather than using purely rule-based extraction techniques such as regex. A Lucene index provides fast look-ups of term frequencies and can therefore help inform the likelihood that a word is being used in a particular context, given large volumes of training data. You'll need to get creative about what sources of existing pre-tagged data might be useful for training and write a bunch of custom code but in my experience Lucene can be useful for extraction when used in this context. Cheers, Mark ----- Original Message ---- From: Julien Nioche <lists.digitalpeb...@gmail.com> To: java-user@lucene.apache.org Sent: Thu, 14 January, 2010 12:41:01 Subject: Re: Extracting contact data Hi, Tools like GATE (http://www.gate.ac.uk) or Apache UIMA would be good candidates for what you are trying to achieve. HTH -- DigitalPebble Ltd http://www.digitalpebble.com 2010/1/14 Ortelli, Gian Luca <gianluca.orte...@truvo.com> > > Well, the exact definition we're going to find out empirically, > as we run an implementation through our data and look at the quality > of results... For now, I would use the number of tokens between the > finding ("a...@def.com") and the word that gives context ("Contact"). > > Anyway, replying to karl: I'm not searching for a given > email/street/time interval/etc., I need to extract EVERY > email/street/time interval/etc. from the text. The kind of need for > which you suggest a natural language processing tool. > > Gianluca > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Wednesday, January 13, 2010 6:06 PM > To: java-user@lucene.apache.org > Subject: Re: Extracting contact data > > Before answering, how to you measure "proximity"? You can make > Lucene work with locations (there's an example in Lucene In Action) > readily enough though.... > > HTH > Erick > > On Wed, Jan 13, 2010 at 11:39 AM, Ortelli, Gian Luca < > gianluca.orte...@truvo.com> wrote: > > > Hi community, > > > > > > > > I have a general understanding of Lucene concepts, and I'm wondering > if > > it's the right tool for my job: > > > > > > > > - I need to extract data like e.g. time intervals ("8am - 12pm"), > street > > addresses from a set of files. The common issue with this data unit is > > that they contain spaces and are not always definable through regexes. > > > > > > > > - the extraction must take into consideration the "proximity": for > > example, a mail address which is close to the work "Contacts" will > > receive a higher rank, since I'm looking for contact data. > > > > > > > > Do you think I can get any advantage from building a solution on > Lucene? > > > > > > > > Gianluca > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org