Iain,

Ah thanks for that. I am actually playing with it right now.
Are you using it?

----- Original Message ----- From: "Iain" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Sunday, September 24, 2006 6:26 PM
Subject: RE: crawl/index/search


You might want to check out GATE from Sheffield University. It's like UIMA
in concept, but more mature and probably richer.

They've got a number of modules which integrate with Lucene, so integration
with Nutch should be easier.


Iain

---------------
Iain Downs (Microsoft MVP)
Commercial Software Therapist
E:  [EMAIL PROTECTED]     T:+44 (0) 1423 872988
W: www.idcl.co.uk
http://mvp.support.microsoft.com
-----Original Message-----
From: Fadzi Ushewokunze [mailto:[EMAIL PROTECTED]
Sent: 24 September 2006 04:03
To: [email protected]
Subject: Re: crawl/index/search

Richard,

Thanks for the insight.

I have spent the past few days looking around lightweight structured text,
text mining and eventually Natural Langauge Processing. Through
further research I came across UIMA from IBM - i liked
the idea behind it. I played around with it but it is a huge monster!

Its still new to me so I am still getting my head around it but I think
it has the potential to achieve a lot. Have you ever dealt with it?

Or for that matter, if anyone in the community has, it would be nice to
get some info on this, especially if you have integrated it with
nutch/lucene.

Seems UIMA will be in the apache incubator - it also has a decent size
community behind it already.

Anyway this is a whole new world (NLP, structured text etc..) at the moment
for me -
so I am still evaluating my requirements and what tools are available.


----- Original Message ----- From: "Richard Braman" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, September 20, 2006 12:55 PM
Subject: Re: crawl/index/search


Getting other information out of the page requires parsing. In this case
you have to come up with some pretty complicated  regular expressions
unless the information that you want like the company name is going to
be in the same place on each site.

I don't know know how to tackle this problem with anything that comes
stock with nutch but writing a plug in would be the way to go,
especially if it is in the public domain.

I have thought about developing a similar plug in, but the question
becomes what do you use?  I view regular expressions as having  many
shortcomings.  For instance they usually only apply to as custom
solution to locating a particular piece of information in a particular
structure.  I would like a more robust framework for matching patterns
that is easy to use that can be extended upon and so forth.  Regular
expressions wont cut it in many cases and don't allow normal users to
write their own.  For example, what is a regular expression for a
company name? Email  Address would be an easy one to make a reg exp for,
which is why some many spammers use web crawlers to harvest email
addresses from the web.

Turns out, there is a whole field of Information Retrieval developing
technologies dedicated to parsing through text and using
advanced ontologies to determine anything and everything about the text
in a document.  They can determine whether a term is a noun, verb,
adjective, and so forth.
They can also determine whether something matches a pattern such as an
email address, address, or company name.   The problem is most of this
is not in the public domain.

I think most user use rexep to find what they are looking for but i am
quite that using an advanced parsing library would certainly yield a
more robust plug in.

In my reseearch, I stumbled on lapis, a lightweight structure for text
processing that uses advanced technology.  LAPIS is open source
developed by MIT and is java based.  I used it an was quite impressed
with its ease of use.  I think this would be a very interesting
framework to adapt to nutch.  If anyone else knows any other open source
libraries for determining structure please comment. You can read more
about lips here or may google "lightweight structure text"
http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis

I would be willing to help you if you would be willing to put the plugin
into the public domain.

Here is the .7 docs for writing plugins:
http://wiki.apache.org/nutch/WritingPluginExample

[EMAIL PROTECTED] wrote:
hi there,

Been playing with Nutch for a few weeks now, so i am starting on
coming up something usable but i need some suggestions here;

Heres the problem - crawl the web (maybe 50 sites or so) and get
physical addreses;

i want to index physical addresses found on the crawl, so my search
results should return "Company Name, State" as the Title, the Summary
can be what ever is found on that page. [this is just an example to
simplify what i want to say]

To index, looking at the Nutch code, seems i have to parse the HTML
content and look for the details I need to be searchable.. at the
moment only things found in META data is indexed but i want to expand
this with custom fields, such as company name, state etc..

Whats the best way to go about this? I want to write a plug in for
this; Which classes do i start with and how do i tackle this?

Thanks














Reply via email to