This has been written after the proposal has been updated and the table
defintion documented there. i.e
http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
But I have decided to extent this thread; it pertains to the database data
model.
I have some questions.
is id a surrogate (sequential or hashed) key
what is entryid? i.e. the article URL? i.e. foreign key to entries
what is name? (The value of the tag?) i.e. foreign key to Tags table
Do I need the definitions of the entries and entries2tags to understand what we
need to do?
I have come to the conclusion that using surrogate keys is(nearly
always) a mistake. The space management advantages are normally negated
by the requirement to join against the tables holding the real keys. You
normally require an additonal unique index to enforce the uniqueness
constraints against both the surrogate & real keys, and will often need
to de-optimise the TNF structure to replicate the real data around the
database to avoid otherwise unnecessary joins. Maintaining these
de-optimisations have both human and perfromance costs.
If you agree that real keys should be used then we need to consider
capitalisation & plurals of tags. Easy one first - leave plural
management to the authors. I know that I am using blog & blogs
seperately, althouhg maybe I should have chosen blogging & blog to be
clearer, but the point is that tag authors are not going to think very
hard about their tags, this is why the del.icio.us tag prompt scheme is
so good. Delicious manages to ignore capitalisation, which I think is
good. Its certainly best for query and aggregation for popularity
queries., I can't believe that anyone wants to know that both "internet"
& "Internet" are popular tags (technorati!)
This implies that we might have two fields to document a tag,
user_tag_name char(32)
normal_tag_name char(32) is proper(user_tag_name)
(sorry for the pseudo-code, but normal* needs to be enforced with a
constraint or trigger.)
This then leaves us with the question of indices (indexes?)
The implications of my data model is that the entry_author key should be
copied down through the entry tabel into the entry-tags table.
author_name char(32) is (part of) FK to entry
entry_name char(128) is (part of) FK to entry
although, a full roller URL contains the author name, but I assume that
the entryname consists of the string used in the entry & anchor queries
so I have a table that looks like this (again pseudo code and I have
guessed at the database column types).
create definedtags as (
author_name char(32)
entry_name (or id?) , char(128)
user_tag_name, char(64)
normal_tag_name, char(64)
entry_date datetime
date_created datetime )
create constraint defined_tags_name as
normal_tag_name=proper(user_tag_name)
create constraint defined_tags_entry_date as
entry_date = select entry_date from entry
where definedtags.entry_name = entry.entry_name
and the following indexes
create dt.tags on definedtags
as author_name, entry_name, normal_tag_name unique
create dt.tags2 on definedtags
as normal_tag_name
create dt.entries on definedtags
as author_name, entry_name
create dt.date_written
on definedtags as entry_date.
I have written this in a hurry so it may not be though out as well as
some of my writing, but hopefully this is collaborative development.
Also I have difficulty in commenting on and reading some of the
front-end & java orientated stuff. (I have ordered a couple of books to
help me catchup). Hopefully this is helpfull
Elias Torres wrote:
Welcome David to the Roller list.
Thank you for your post. I have read your blog post on a tag data
model for Roller. I'm looking forward to your relational algebra and
query cost analysis. I wanted to tell you that we (IBM) have already
added basic tagging support to Roller and it actually supports a
TagCloud. I am supposed to put a proposal in the roller wiki so others
could comment and once I do that, you could put your comments there as
well.
Just to kickstart the conversation I'm including the tagging table we
are currently using.
create table weblogentrytag (
id varchar(48) not null primary key,
entryid varchar(48) not null,
name varchar(255) not null,
tagtime timestamp not null
);
We have basically two tables: entries and entry2tags, but are missing
a tag table. At first, I was very set on having a tag table and use a
foreign key to "save" space on repetitive tag names. But I was shown
it's not really a big space saving technique, especially since tag
names are relatively short storing a guid or int would almost be
comparable in space. There are also increased costs in inserting and
joining on tables to get tag names if using a foreign key, so we have
settle on this for now until we have other queries requirements. I'll
be summarizing all of our changes to roller to support tagging in a
wiki proposal soon.
Regarding the use of the list, some people have been using nabble.com
to interact with it. Maybe you can give it a try. I simply use gmail.
http://www.nabble.com/Roller-f12275.html
Regards,
Elias
On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:
I have documented a data model for tags. This is held at my blog
http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
I have a graphic demonstrating the relationship between authors,
articles and tags and illustrating the first and obvious indexes. (I
have identified that both "Date Published" and tag aggregates are
missing from the model). Since the model was built to help me
understand del.icio.us, I call the entities Users, Bookmarks and Tags,
but hopefully its simple to see that these are pretty synonomous to
authors, articles and tags.
I hope that this is useful for those looking at implementing tags.
I am still working out how to use the mail-list, so I hope that x-refing
you to my blog isn't deprecated. I also need to work how to maintain
thread connections i.e. undertake a reply.
--
Dave
<http://www.sun.com> * David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom
Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]
Sun Proprietary & Confidential . This e-mail message is for the sole use
of the intended recipient(s) and may contain confidential and
privilidged information. Any unauthorised review, use, disclosure or
distribution is prohibited. If you are not the intended recepient,
please contact the sender by reply e-mail and destroy all copies of the
original message.
--
Dave
<http://www.sun.com> * David Levy *
*Sun Microsystems Ltd.*
55, King William St.,
London EC4R 9ND United Kingdom
Phone +44 (0) 20 7469 9908/x18308
Mobile +44 (0) 7710-360922
Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]
Sun Proprietary & Confidential . This e-mail message is for the sole use
of the intended recipient(s) and may contain confidential and
privilidged information. Any unauthorised review, use, disclosure or
distribution is prohibited. If you are not the intended recepient,
please contact the sender by reply e-mail and destroy all copies of the
original message.