You might also look at some other NLP tools, such as OpenNLP which
you can train for your collection, or if you are interested in
buying, there are many products on the market that do similar things
On Sep 26, 2006, at 9:36 AM, Otis Gospodnetic wrote:
Look at LingPipe from Alias-i.com. Look at Named Entity extraction
and its classifiers.
Otis
----- Original Message ----
From: Vladimir Olenin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, September 25, 2006 9:49:31 PM
Subject: does anyone know of a 'smart' categorizing text pattern
finder?
Hi,
I wonder if anyone here knows if there is a 'smart' text pattern
finder, ideally written in Java. The library I'm looking for should
be able to 'guess' the category of the particular text on the page,
most probably by finding similarities between the bulk of the pages
and a set of templates.
Eg, many forums are powered by phpbb, which structures 99% of the
pages (except for some title pages & user profile pages) in a very
similar fashion (page is broken into blocks, each block is broken
into further blocks, etc). By comparing many pages with each other
(eg, from the same domain root: forum.springframework.org) it
should be possible to detect common ('template decorations') and
page specific (actual content, like 'user name' and 'posting body')
parts. After that it should further be possible, by comparing
'template decorations' parts to a set of templates, to 'guess' the
nature of each of the 'page specific' block (eg, 'Vladimir Olenin'
in the left side column will be marked as 'name', while whatever is
adjucent to this column is the post body).
So, I wonder if anyone knows of a package capable of such things.
Primary goal though is simplier: to be able to parse out just
posters' names from message boards. Though sometimes the 'block
category' can be derived from CSS class name of the tags around the
text, it's very often not the case.
Might Nutch have similar functionality built into their crawler?
Thanks.
Vlad
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]