aseek-devel  

[aseek-devel] Re: [aseek-users] Ignoring the title, description, and keywords

Kir Kolyshkin
Thu, 07 Nov 2002 18:07:17 -0800

Matt Sullivan wrote:
On Mon, 04 Nov 2002 at 02:54:42 +0300, Kir Kolyshkin wrote:


Francesco Caccavella wrote:

At 19.46 02/11/2002 -0500, you wrote:


Is there a way I can ask searchd to ignore the HTML title and the META
keywords and description when performing a search?

I've configured my system to use tl=off, ds=off, and kw=off but that doesn't
seem to be working.

Thanks!

As I suggest between the lines some messages ago (http://forum.aspseek.org/index.php?t=msg&goto=1472), in the next release (when? ;-) would be useful to disallow (or simply diminish) the weight of the meta keywords and description for the results (directly in aspseek.conf or searchd.conf). Google ignores this meta too.
Probably a better idea would be to implement two "custom"
meta-tags that could be used instead of "Keywords" and
"Description" by users which wants them, and have an additional
option in either searchd.conf or s.htm's "variables" section
to specify the defaults about which fields are to be used.
Matt, what's your opinion?

Yes, I think there would be some benefit in this.

As I see it, this comes back to the "href text" modifications.  I had a
working version of this some months back but it got put on hold while I
re-wrote libaspseek and the PHP module and I haven't come back to it as
yet.  I think that was at version 1.2.6 so I have some work to do to bring
the patch up to date.

I had also started to implement domain name words using a slightly
modified ternary tree of dictionary words to decompose the domain string
into words.

Currently there is no space to support custom tags within the word indexes
(all combinations of the 2 bits assigned for this are used) however with
the href text modifications this will be extended to 4 bits of which the
msb will indicate "Body" leaving 3 usable bits.  Out of this will be
assigned "Title", "Description", "Keywords", "HREF Text" and "Domain Words"
leaving 3 unused slots.
What about having a variable length of "section" part in splitting
into "section" and "word_position"? I mean, currently we have 2 bits for "section", your approach is to have 4 bits.

Taking into account the thing that we need to store large numbers into "word_position" only for the BODY section (this is the _main_ idea and
assumption), my proposal is to have it this way:

If the first bit (MSB) is 0, this means BODY, and only this one bit is
used for "section" (thus we are adding one extra bit to "word_position"
for the BODY section, which makes some good sense to me).

Otherwise (MSB is 1) the layout can be the following:
6 bytes for "section" (first is always 1, so we can have
2^5 = 32 sections):
100000 - TITLE
100001 - KEYWORDS
100010 - DESCRIPTION
100011 - HREF_TEXT
100100 - URL_TEXT (or "Domain Words")

10xxxx - RESERVED (11 combinations, probably can be used for, say,
MP3's ID3 tags (Author/Title/Album/Year/Genre/Whatever),
or some other cool things like image search
(IMG ALT/TITLE/URL_TEXT/HREF_TEXT/SURROUNDING_TEXT)

11xxxx - CUSTOM DEFINED SECTIONS (up to 16). Actually people
asked me a several times to have such an ability to
search, say, for Author META, so we can at least
reserve a space for it.

Matt, what do you think? To my mind, it looks quite OK, but code
changes needed can be quite massive (but the same is truth for your
4-bit proposal).

PS I'm still ill (and probably on some antibiotic drugs, too), so forgive (and fix) me if I am mistaken somewhere in the above mumblings.

I agree that the order of relevance and the overall weight of these should
be configurable.

I'd like to get the href text additions into the devel tree but I'm not
sure of a timeframe for this yet.

I'm cross posting this to aseek-devel since I think the thread should be
continued there.


Matt.