Re: Skipping Root File from Indexing

Markus Jelsma Tue, 24 Apr 2012 15:05:00 -0700

Hi,

With 1.5 or lower you have two options, either you skip the record in asimple custom indexing filter or you skip it in a custom UpdateProcessorin Solr. You can also try NUTCH-1300, it has a patch enabling filteringand normalizing for the indexer job. With the RegexURLFilter enabled andfilter rules specifically for the indexer you can skip the URL. Ithasn't been fully tested but it should work. It's likely to be committedfor trunk in the next month.


https://issues.apache.org/jira/browse/NUTCH-1300

Cheers

On Tue, 24 Apr 2012 12:44:56 -0700 (PDT), atul <[email protected]>wrote:

Hi,

We have a nutch-solr combination in place to built up a web page.
We are reading a source index.html file which contains links to otherweb
pages.
Our code is working fine, we are getting rest of the web pagesindexed
following the URL's on index.html.
However we don't want index.html file to get indexed in solr. We wantrest
all the internal URL's (web pages)
indexed except the root page.

Please advice, how this can be achieved?

Thanks,
Atul

--
View this message in context:

http://lucene.472066.n3.nabble.com/Skipping-Root-File-from-Indexing-tp3936375p3936375.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Skipping Root File from Indexing

Reply via email to