On Sat, Feb 7, 2009 at 3:20 PM, Andrzej Bialecki <[email protected]> wrote: > (moving this to nutch-user - nutch-agent is for reporting abuse/misbehavior > of Nutch-based crawlers) > > John Crepezzi wrote: >> >> I'm interested in writing an application that analyzes sources every time >> they are updated, >> and uses the parsedText, tags, title, etc to perform some operations and >> export the finished data to >> a database. >> >> I'm not sure where this application should be placed within nutch/lucene, >> so any advice anyone can offer would be greatly appreciated. >> >> I thought plugins would work for me, but I'm unable to find an extension >> point that will give me access >> to the parsed data and tag sets. > > This issue comes up occasionally, but so far no one was desperate enough to > work out a patch ;) > > You can define an additional extension point (please see how eg. > HtmlParseFilter extension is designed - perhaps this extension is all you > need?), and invoke this new extension point right after you parse the > content, so that you can access both the content and the parsed data/text > even before it's recorded in a segment. > > The best place to put this hook would be in ParseUtil class, because that's > what other Nutch tools use to parse the content. > >
Another way to do it is to use the new NutchIndexWriter-s. You can add a new DBIndexWriter (like SolrIndexWriter or LuceneIndexWriter) then modify the index process (or add a new Indexer) to push documents to your database. > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney
