Michael Böckling wrote:
What you should do is to compare the structure nutch uses with the structure you use, and somehow combine the two. In most of the fields, you sould converge to the nutch version. Other than that, once index the index is created from nutch, it is lucene stuff. You can merge the indexes or run a MultiSearcher, or open seperate DistributedSearch$Clients and combine the results from seperate indexes on the fly. However there is an issue about summaries. Do you intend to use them?

I see. I don't think I can unify the index fields, since we use a very
granular field structure for our DB content. It would be ok to have the
results displayed on the web page separated, with the first paragraph
showing the DB search results and the second one for the Nutch results,
effectively running and querying the two indexes separately.
Then it is more simple to use lucene and nutch together
Further issues:
- Are Lucene and Nutch Queries compatible? I've heard the "Query" class
hierarchy is different for Nutch. Basically, a query that works for Lucene
(maybe containing boolean operators, phrases etc.) should not throw an
exception or so in Nutch and return sensible results.
Yes Nutch uses a Query class different then lucene. The query is also parsed differently, What nutch does basically is that, nutch parses the query with Query.parse, then it runs all the query plugins, which convert the nutch query to lucene boolean query. Then this lucene
query is sent to index servers, which uses lucene's searchers.

- I need to exclude things like header, footer and navigation from the
crawled pages and only index the content of a certain area. Can this be done
in Nutch? I found some vague hints pointing to HtmlParser and Plugins...
Yes you can write a html plugin to only parse desired content.
- My working environment for the current search is Java 1.4.2 and Lucene
2.1. I guess I have to use Nutch 0.8 (since 0.9 switched to Java 1.5) and
hope it can cope with the newer Lucene version?

Nutch 0.9 uses lucene 2.1.
Thanks a lot for your help so far!


Welcome :)
Regards,

Michael Böckling


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to