Getting other information out of the page requires parsing. In this case you have to come up with some pretty complicated regular expressions unless the information that you want like the company name is going to be in the same place on each site.
I don't know know how to tackle this problem with anything that comes stock with nutch but writing a plug in would be the way to go, especially if it is in the public domain. I have thought about developing a similar plug in, but the question becomes what do you use? I view regular expressions as having many shortcomings. For instance they usually only apply to as custom solution to locating a particular piece of information in a particular structure. I would like a more robust framework for matching patterns that is easy to use that can be extended upon and so forth. Regular expressions wont cut it in many cases and don't allow normal users to write their own. For example, what is a regular expression for a company name? Email Address would be an easy one to make a reg exp for, which is why some many spammers use web crawlers to harvest email addresses from the web. Turns out, there is a whole field of Information Retrieval developing technologies dedicated to parsing through text and using advanced ontologies to determine anything and everything about the text in a document. They can determine whether a term is a noun, verb, adjective, and so forth. They can also determine whether something matches a pattern such as an email address, address, or company name. The problem is most of this is not in the public domain. I think most user use rexep to find what they are looking for but i am quite that using an advanced parsing library would certainly yield a more robust plug in. In my reseearch, I stumbled on lapis, a lightweight structure for text processing that uses advanced technology. LAPIS is open source developed by MIT and is java based. I used it an was quite impressed with its ease of use. I think this would be a very interesting framework to adapt to nutch. If anyone else knows any other open source libraries for determining structure please comment. You can read more about lips here or may google "lightweight structure text" http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis I would be willing to help you if you would be willing to put the plugin into the public domain. Here is the .7 docs for writing plugins: http://wiki.apache.org/nutch/WritingPluginExample [EMAIL PROTECTED] wrote: > hi there, > > Been playing with Nutch for a few weeks now, so i am starting on > coming up something usable but i need some suggestions here; > > Heres the problem - crawl the web (maybe 50 sites or so) and get > physical addreses; > > i want to index physical addresses found on the crawl, so my search > results should return "Company Name, State" as the Title, the Summary > can be what ever is found on that page. [this is just an example to > simplify what i want to say] > > To index, looking at the Nutch code, seems i have to parse the HTML > content and look for the details I need to be searchable.. at the > moment only things found in META data is indexed but i want to expand > this with custom fields, such as company name, state etc.. > > Whats the best way to go about this? I want to write a plug in for > this; Which classes do i start with and how do i tackle this? > > Thanks > > > > > >