Getting other information out of the page requires parsing. In this case
you have to come up with some pretty complicated  regular expressions
unless the information that you want like the company name is going to
be in the same place on each site. 

I don't know know how to tackle this problem with anything that comes
stock with nutch but writing a plug in would be the way to go,
especially if it is in the public domain.

I have thought about developing a similar plug in, but the question
becomes what do you use?  I view regular expressions as having  many
shortcomings.  For instance they usually only apply to as custom
solution to locating a particular piece of information in a particular
structure.  I would like a more robust framework for matching patterns
that is easy to use that can be extended upon and so forth.  Regular
expressions wont cut it in many cases and don't allow normal users to
write their own.  For example, what is a regular expression for a
company name? Email  Address would be an easy one to make a reg exp for,
which is why some many spammers use web crawlers to harvest email
addresses from the web.

Turns out, there is a whole field of Information Retrieval developing
technologies dedicated to parsing through text and using
advanced ontologies to determine anything and everything about the text
in a document.  They can determine whether a term is a noun, verb,
adjective, and so forth.
They can also determine whether something matches a pattern such as an
email address, address, or company name.   The problem is most of this
is not in the public domain.

I think most user use rexep to find what they are looking for but i am
quite that using an advanced parsing library would certainly yield a
more robust plug in.

In my reseearch, I stumbled on lapis, a lightweight structure for text
processing that uses advanced technology.  LAPIS is open source
developed by MIT and is java based.  I used it an was quite impressed
with its ease of use.  I think this would be a very interesting
framework to adapt to nutch.  If anyone else knows any other open source
libraries for determining structure please comment. You can read more
about lips here or may google "lightweight structure text"
http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis

I would be willing to help you if you would be willing to put the plugin
into the public domain.

Here is the .7 docs for writing plugins:
http://wiki.apache.org/nutch/WritingPluginExample

[EMAIL PROTECTED] wrote:
> hi there,
>
> Been playing with Nutch for a few weeks now, so i am starting on
> coming up something usable but i need some suggestions here;
>
> Heres the problem - crawl the web (maybe 50 sites or so) and get
> physical addreses;
>
> i want to index physical addresses found on the crawl, so my search
> results should return "Company Name, State" as the Title, the Summary
> can be what ever is found on that page. [this is just an example to
> simplify what i want to say]
>
> To index, looking at the Nutch code, seems i have to parse the HTML
> content and look for the details I need to be searchable.. at the
> moment only things found in META data is indexed but i want to expand
> this with custom fields, such as company name, state etc..
>
> Whats the best way to go about this? I want to write a plug in for
> this; Which classes do i start with and how do i tackle this?
>
> Thanks
>
>
>
>
>
>


Reply via email to