My approach to tackle structured information is to use DBSight, which
create Lecene indexes on retrieved data from any database.

As Erik mentioned, scraping is highly fragile. By going directly to
database, we can get more reliable/up-to-date/flexible with the data.
On the other hand, you will need database access, and this approach is
quite different from Nutch.

Or Nutch/Lucene can provide a simple XML analyzer, consuming a
specific format of XML data filtered by any plug-in XSL from any XML
structure.

-- 
Chris Lu
---------------------
Full-Text Search on Any Database
http://www.dbsight.net


On 7/26/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> Further on the information extraction idea, consider what the SIMILE
> team at MIT are doing... http://simile.mit.edu
> 
> The lower-case semantic web is gaining a lot of momentum these days,
> and I'm a strong proponent and student of it at the moment.  Scraping
> rich information from a site is certainly reasonably pragmatic, but
> it is also highly fragile.  SIMILE's Piggy Bank has a scraper
> facility.  In an more ideal world, computer shops, book stores,
> libraries, and anyone with data to share would publish it in a
> reusable and structured way (RDF seems to me to be the best way to do
> this).  Merging a full-text search engine with structured
> information, though, is yet another tricky thing that I am myself
> working with at the moment.
> 
> I'd love to have more discussions along these lines.
> 
>      Erik
> 
> 
> On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:
> 
> > Hi Jack,
> >
> > I've been doing research the last few days and I think that once
> > successfully implemented, an information extraction system should
> > be able to
> > extract information from various sources. I've started reading
> > pattern/context free grammar/ontology which I think will be the
> > core of such
> > a system. I intend to index computer shops.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 26 July 2005 6:16 PM
> > To: [email protected]; [email protected]
> > Subject: Re: Information extraction
> >
> > Hi Cuong.
> >
> > I am going to build private book search engine. And I am face the same
> > problem.
> > Could you describe more about the information you want to extract and
> > the website?
> >
> > Regards
> > /Jack
> >
> > On 7/26/05, Cuong Hoang <[EMAIL PROTECTED]> wrote:
> >
> >> Hi all,
> >>
> >>
> >>
> >> Does anyone have experience with designing web information
> >> extraction such
> >> as shopbots/pricebots? I'm currently doing research on this topic
> >> and want
> >> to integrate Nutch. A few guidelines from anyone who has designed
> >> this
> >>
> > type
> >
> >> of systems will really be helpful to me.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> Cuong Hoang
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to