Iain, Thanks for the pointer to GATE. I will take a look at it too. Richard
Fadzi Ushewokunze wrote: > Iain, > > Ah thanks for that. I am actually playing with it right now. > Are you using it? > > ----- Original Message ----- From: "Iain" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Sunday, September 24, 2006 6:26 PM > Subject: RE: crawl/index/search > > >> You might want to check out GATE from Sheffield University. It's >> like UIMA >> in concept, but more mature and probably richer. >> >> They've got a number of modules which integrate with Lucene, so >> integration >> with Nutch should be easier. >> >> >> Iain >> >> --------------- >> Iain Downs (Microsoft MVP) >> Commercial Software Therapist >> E: [EMAIL PROTECTED] T:+44 (0) 1423 872988 >> W: www.idcl.co.uk >> http://mvp.support.microsoft.com >> -----Original Message----- >> From: Fadzi Ushewokunze [mailto:[EMAIL PROTECTED] >> Sent: 24 September 2006 04:03 >> To: [email protected] >> Subject: Re: crawl/index/search >> >> Richard, >> >> Thanks for the insight. >> >> I have spent the past few days looking around lightweight structured >> text, >> text mining and eventually Natural Langauge Processing. Through >> further research I came across UIMA from IBM - i liked >> the idea behind it. I played around with it but it is a huge monster! >> >> Its still new to me so I am still getting my head around it but I think >> it has the potential to achieve a lot. Have you ever dealt with it? >> >> Or for that matter, if anyone in the community has, it would be nice to >> get some info on this, especially if you have integrated it with >> nutch/lucene. >> >> Seems UIMA will be in the apache incubator - it also has a decent size >> community behind it already. >> >> Anyway this is a whole new world (NLP, structured text etc..) at the >> moment >> for me - >> so I am still evaluating my requirements and what tools are available. >> >> >> ----- Original Message ----- From: "Richard Braman" >> <[EMAIL PROTECTED]> >> To: <[email protected]> >> Sent: Wednesday, September 20, 2006 12:55 PM >> Subject: Re: crawl/index/search >> >> >>> Getting other information out of the page requires parsing. In this >>> case >>> you have to come up with some pretty complicated regular expressions >>> unless the information that you want like the company name is going to >>> be in the same place on each site. >>> >>> I don't know know how to tackle this problem with anything that comes >>> stock with nutch but writing a plug in would be the way to go, >>> especially if it is in the public domain. >>> >>> I have thought about developing a similar plug in, but the question >>> becomes what do you use? I view regular expressions as having many >>> shortcomings. For instance they usually only apply to as custom >>> solution to locating a particular piece of information in a particular >>> structure. I would like a more robust framework for matching patterns >>> that is easy to use that can be extended upon and so forth. Regular >>> expressions wont cut it in many cases and don't allow normal users to >>> write their own. For example, what is a regular expression for a >>> company name? Email Address would be an easy one to make a reg exp >>> for, >>> which is why some many spammers use web crawlers to harvest email >>> addresses from the web. >>> >>> Turns out, there is a whole field of Information Retrieval developing >>> technologies dedicated to parsing through text and using >>> advanced ontologies to determine anything and everything about the text >>> in a document. They can determine whether a term is a noun, verb, >>> adjective, and so forth. >>> They can also determine whether something matches a pattern such as an >>> email address, address, or company name. The problem is most of this >>> is not in the public domain. >>> >>> I think most user use rexep to find what they are looking for but i am >>> quite that using an advanced parsing library would certainly yield a >>> more robust plug in. >>> >>> In my reseearch, I stumbled on lapis, a lightweight structure for text >>> processing that uses advanced technology. LAPIS is open source >>> developed by MIT and is java based. I used it an was quite impressed >>> with its ease of use. I think this would be a very interesting >>> framework to adapt to nutch. If anyone else knows any other open >>> source >>> libraries for determining structure please comment. You can read more >>> about lips here or may google "lightweight structure text" >>> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis >>> >>> I would be willing to help you if you would be willing to put the >>> plugin >>> into the public domain. >>> >>> Here is the .7 docs for writing plugins: >>> http://wiki.apache.org/nutch/WritingPluginExample >>> >>> [EMAIL PROTECTED] wrote: >>>> hi there, >>>> >>>> Been playing with Nutch for a few weeks now, so i am starting on >>>> coming up something usable but i need some suggestions here; >>>> >>>> Heres the problem - crawl the web (maybe 50 sites or so) and get >>>> physical addreses; >>>> >>>> i want to index physical addresses found on the crawl, so my search >>>> results should return "Company Name, State" as the Title, the Summary >>>> can be what ever is found on that page. [this is just an example to >>>> simplify what i want to say] >>>> >>>> To index, looking at the Nutch code, seems i have to parse the HTML >>>> content and look for the details I need to be searchable.. at the >>>> moment only things found in META data is indexed but i want to expand >>>> this with custom fields, such as company name, state etc.. >>>> >>>> Whats the best way to go about this? I want to write a plug in for >>>> this; Which classes do i start with and how do i tackle this? >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > > > >
