[General] Webboard: tagging or categorizing without crawling again

bar Thu, 11 Dec 2014 21:08:16 -0800

Author: Alexander Barkov
Email: [email protected]
Message:
> Actually, the way of using tag or categories is perfect but, i don't want 
> to crawl again the whole site because i didn't write my tagging rule in 
> the correct way the first time.


This task consists of two parts:

a. update what you have in the tables "server" and "srvinfo".
This is done automatically when you start crawling.
"indexer -n0" will do this. Note, this is enough when you just need
to rename some tag to a new value.

But usually this is not enough,
as you might want to redistribute documents between tags
(i.e. split a single tag into multiple ones, or join multiple tags
into a single one, or do some more complex redistribution).
In these cases part "b" is also needed.


b. update the table "url" to refer to the table "server" properly.
There is no a special command for this. Normally, documents are 
updated properly only when they're crawled next time.
But there is a trick to use "Skip" option temporarily,
to avoid real downloading.


Suppose you want to split the section of your site
into two subsections and assign different tags for them.

What you do is:

1. Change indexer.conf:

# Remove the old command
Tag doc
Server http://host/doc/


# And add two new commands instead
Tag doca
Server skip http://host/doc/a/

Tag docb
Server skip http://host/doc/b/


Notice the "skip" option in the new commands.


2. Run "indexer -am -u 'http://host/doc/%'"

It will a kind "crawl" all documents, but without real downloading.
It will actually only nothing else but execute a query like this
for every document:

UPDATE url SET status=200,next_index_time=1418965297, 
site_id=-1519382294,server_id=-1738492707 WHERE rec_id=259;


3. Make sure not to forget to remove the "skip" options
from the new "Server" commands in indexer.conf.

4. Check that everything went well:
SELECT server.tag,url.url FROM url,server WHERE url.server_id=server.rec_id;




Reply: <http://www.mnogosearch.org/board/message.php?id=21669>

_______________________________________________
General mailing list
[email protected]
http://lists.mnogosearch.org/listinfo/general

[General] Webboard: tagging or categorizing without crawling again

Reply via email to