***I might have posted this already, my mail server is playing up. apologies if 
so***

hi there,

Been playing with Nutch for a few weeks now, so i am starting on coming up 
something
usable but i need some suggestions here;

Heres the problem - crawl the web (maybe 50 sites or so) and get physical 
addreses;

i want to index physical addresses found on the crawl, so my search results 
should return
"Company Name, State" as the Title, the Summary can be what ever is found on 
that page.
[this is just an example to simplify what i want to say]

To index, looking at the Nutch code, seems i have to parse the HTML content and 
look for
the details I need to be searchable.. at the moment only things found in META 
data is
indexed but i want to expand this with custom fields, such as company name, 
state etc..

Whats the best way to go about this? I want to write a plug in for this; Which 
classes do
i start with and how do i tackle this?

Thanks 

Fadzi

Reply via email to