search

Richard Braman Sun, 24 Sep 2006 09:20:24 -0700

Iain,
Thanks for the pointer to GATE.  I will take a look at it too.
Richard


Fadzi Ushewokunze wrote:
> Iain,
>
> Ah thanks for that. I am actually playing with it right now.
> Are you using it?
>
> ----- Original Message ----- From: "Iain" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Sunday, September 24, 2006 6:26 PM
> Subject: RE: crawl/index/search
>
>
>> You might want to check out GATE from Sheffield University.  It's
>> like UIMA
>> in concept, but more mature and probably richer.
>>
>> They've got a number of modules which integrate with Lucene, so
>> integration
>> with Nutch should be easier.
>>
>>
>> Iain
>>
>> ---------------
>> Iain Downs (Microsoft MVP)
>> Commercial Software Therapist
>> E:  [EMAIL PROTECTED]     T:+44 (0) 1423 872988
>> W: www.idcl.co.uk
>> http://mvp.support.microsoft.com
>> -----Original Message-----
>> From: Fadzi Ushewokunze [mailto:[EMAIL PROTECTED]
>> Sent: 24 September 2006 04:03
>> To: [email protected]
>> Subject: Re: crawl/index/search
>>
>> Richard,
>>
>> Thanks for the insight.
>>
>> I have spent the past few days looking around lightweight structured
>> text,
>> text mining and eventually Natural Langauge Processing. Through
>> further research I came across UIMA from IBM - i liked
>> the idea behind it. I played around with it but it is a huge monster!
>>
>> Its still new to me so I am still getting my head around it but I think
>> it has the potential to achieve a lot. Have you ever dealt with it?
>>
>> Or for that matter, if anyone in the community has, it would be nice to
>> get some info on this, especially if you have integrated it with
>> nutch/lucene.
>>
>> Seems UIMA will be in the apache incubator - it also has a decent size
>> community behind it already.
>>
>> Anyway this is a whole new world (NLP, structured text etc..) at the
>> moment
>> for me -
>> so I am still evaluating my requirements and what tools are available.
>>
>>
>> ----- Original Message ----- From: "Richard Braman"
>> <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Wednesday, September 20, 2006 12:55 PM
>> Subject: Re: crawl/index/search
>>
>>
>>> Getting other information out of the page requires parsing. In this
>>> case
>>> you have to come up with some pretty complicated  regular expressions
>>> unless the information that you want like the company name is going to
>>> be in the same place on each site.
>>>
>>> I don't know know how to tackle this problem with anything that comes
>>> stock with nutch but writing a plug in would be the way to go,
>>> especially if it is in the public domain.
>>>
>>> I have thought about developing a similar plug in, but the question
>>> becomes what do you use?  I view regular expressions as having  many
>>> shortcomings.  For instance they usually only apply to as custom
>>> solution to locating a particular piece of information in a particular
>>> structure.  I would like a more robust framework for matching patterns
>>> that is easy to use that can be extended upon and so forth.  Regular
>>> expressions wont cut it in many cases and don't allow normal users to
>>> write their own.  For example, what is a regular expression for a
>>> company name? Email  Address would be an easy one to make a reg exp
>>> for,
>>> which is why some many spammers use web crawlers to harvest email
>>> addresses from the web.
>>>
>>> Turns out, there is a whole field of Information Retrieval developing
>>> technologies dedicated to parsing through text and using
>>> advanced ontologies to determine anything and everything about the text
>>> in a document.  They can determine whether a term is a noun, verb,
>>> adjective, and so forth.
>>> They can also determine whether something matches a pattern such as an
>>> email address, address, or company name.   The problem is most of this
>>> is not in the public domain.
>>>
>>> I think most user use rexep to find what they are looking for but i am
>>> quite that using an advanced parsing library would certainly yield a
>>> more robust plug in.
>>>
>>> In my reseearch, I stumbled on lapis, a lightweight structure for text
>>> processing that uses advanced technology.  LAPIS is open source
>>> developed by MIT and is java based.  I used it an was quite impressed
>>> with its ease of use.  I think this would be a very interesting
>>> framework to adapt to nutch.  If anyone else knows any other open
>>> source
>>> libraries for determining structure please comment. You can read more
>>> about lips here or may google "lightweight structure text"
>>> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis
>>>
>>> I would be willing to help you if you would be willing to put the
>>> plugin
>>> into the public domain.
>>>
>>> Here is the .7 docs for writing plugins:
>>> http://wiki.apache.org/nutch/WritingPluginExample
>>>
>>> [EMAIL PROTECTED] wrote:
>>>> hi there,
>>>>
>>>> Been playing with Nutch for a few weeks now, so i am starting on
>>>> coming up something usable but i need some suggestions here;
>>>>
>>>> Heres the problem - crawl the web (maybe 50 sites or so) and get
>>>> physical addreses;
>>>>
>>>> i want to index physical addresses found on the crawl, so my search
>>>> results should return "Company Name, State" as the Title, the Summary
>>>> can be what ever is found on that page. [this is just an example to
>>>> simplify what i want to say]
>>>>
>>>> To index, looking at the Nutch code, seems i have to parse the HTML
>>>> content and look for the details I need to be searchable.. at the
>>>> moment only things found in META data is indexed but i want to expand
>>>> this with custom fields, such as company name, state etc..
>>>>
>>>> Whats the best way to go about this? I want to write a plug in for
>>>> this; Which classes do i start with and how do i tackle this?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>
>

Re: crawl/index/search

Reply via email to