This has been written after the proposal has been updated and the table defintion documented there. i.e

http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags

But I have decided to extent this thread; it pertains to the database data 
model.

I have some questions.

is id a surrogate (sequential or hashed) key
what is entryid? i.e. the article URL? i.e. foreign key to entries
what is name? (The value of the tag?) i.e. foreign key to Tags table

Do I need the definitions of the entries and entries2tags to understand what we 
need to do?

I have come to the conclusion that using surrogate keys is(nearly always) a mistake. The space management advantages are normally negated by the requirement to join against the tables holding the real keys. You normally require an additonal unique index to enforce the uniqueness constraints against both the surrogate & real keys, and will often need to de-optimise the TNF structure to replicate the real data around the database to avoid otherwise unnecessary joins. Maintaining these de-optimisations have both human and perfromance costs.

If you agree that real keys should be used then we need to consider capitalisation & plurals of tags. Easy one first - leave plural management to the authors. I know that I am using blog & blogs seperately, althouhg maybe I should have chosen blogging & blog to be clearer, but the point is that tag authors are not going to think very hard about their tags, this is why the del.icio.us tag prompt scheme is so good. Delicious manages to ignore capitalisation, which I think is good. Its certainly best for query and aggregation for popularity queries., I can't believe that anyone wants to know that both "internet" & "Internet" are popular tags (technorati!)

This implies that we might have two fields to document a tag,

user_tag_name            char(32)
normal_tag_name        char(32)  is proper(user_tag_name)

(sorry for the pseudo-code, but normal* needs to be enforced with a constraint or trigger.)

This then leaves us with the question of indices (indexes?)

The implications of my data model is that the entry_author key should be copied down through the entry tabel into the entry-tags table.

author_name            char(32) is (part of) FK to  entry
entry_name              char(128) is (part of) FK to entry

although, a full roller URL contains the author name, but I assume that the entryname consists of the string used in the entry & anchor queries so I have a table that looks like this (again pseudo code and I have guessed at the database column types).

create definedtags as (
author_name                    char(32)
entry_name (or id?) ,       char(128)
user_tag_name,               char(64)
normal_tag_name,           char(64)
entry_date                         datetime
date_created                    datetime )

create constraint defined_tags_name as
   normal_tag_name=proper(user_tag_name)

create constraint defined_tags_entry_date as
   entry_date = select entry_date from entry
                          where definedtags.entry_name = entry.entry_name

and the following indexes

create dt.tags on definedtags
   as author_name, entry_name, normal_tag_name unique

create dt.tags2 on definedtags
   as normal_tag_name

create dt.entries on definedtags
   as author_name, entry_name

create dt.date_written
   on definedtags as entry_date.

I have written this in a hurry so it may not be though out as well as some of my writing, but hopefully this is collaborative development. Also I have difficulty in commenting on and reading some of the front-end & java orientated stuff. (I have ordered a couple of books to help me catchup). Hopefully this is helpfull



Elias Torres wrote:

Welcome David to the Roller list.

Thank you for your post. I have read your blog post on a tag data
model for Roller. I'm looking forward to your relational algebra and
query cost analysis. I wanted to tell you that we (IBM) have already
added basic tagging support to Roller and it actually supports a
TagCloud. I am supposed to put a proposal in the roller wiki so others
could comment and once I do that, you could put your comments there as
well.

Just to kickstart the conversation I'm including the tagging table we
are currently using.

create table weblogentrytag (
   id              varchar(48)   not null primary key,
   entryid        varchar(48)   not null,
   name            varchar(255)  not null,
   tagtime         timestamp     not null
);

We have basically two tables: entries and entry2tags, but are missing
a tag table. At first, I was very set on having a tag table and use a
foreign key to "save" space on repetitive tag names. But I was shown
it's not really a big space saving technique, especially since tag
names are relatively short storing a guid or int would almost be
comparable in space. There are also increased costs in inserting and
joining on tables to get tag names if using a foreign key, so we have
settle on this for now until we have other queries requirements. I'll
be summarizing all of our changes to roller to support tagging in a
wiki proposal soon.

Regarding the use of the list, some people have been using nabble.com
to interact with it. Maybe you can give it a try. I simply use gmail.

http://www.nabble.com/Roller-f12275.html

Regards,

Elias

On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:
I have documented a data model for tags. This is held at my blog

http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database

I have a graphic demonstrating the relationship between authors,
articles and tags and illustrating the first and obvious indexes. (I
have identified that both "Date Published" and tag aggregates are
missing from the model).  Since the model was built to help me
understand del.icio.us, I call the entities Users, Bookmarks and Tags,
but hopefully its simple to see that these are pretty synonomous to
authors, articles and tags.

I hope that this is useful for those looking at implementing tags.

I am still working out how to use the mail-list, so I hope that x-refing
you to my blog isn't deprecated. I also need to work how to maintain
thread connections i.e. undertake a reply.
--

Dave

<http://www.sun.com>    * David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom

Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]

Sun Proprietary & Confidential . This e-mail message is for the sole use
of the intended recipient(s) and may contain confidential and
privilidged information. Any unauthorised review, use, disclosure or
distribution is prohibited. If you are not the intended recepient,
please contact the sender by reply e-mail and destroy all copies of the
original message.



--

Dave

<http://www.sun.com>      * David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom

Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]

Sun Proprietary & Confidential . This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privilidged information. Any unauthorised review, use, disclosure or distribution is prohibited. If you are not the intended recepient, please contact the sender by reply e-mail and destroy all copies of the original message.

Reply via email to