Sorry to have taken so long.

The denormalisations of the author_name (which may be owner name) and entrydate are to support queries. This is because I expect a macro to create a tag cloud for a user so that the html versions can have the tag cloud,

So

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
group by normal_tag_name

gives us the data required for a tag cloud, for a single blog . No join as you can see, where it gets fun is if you want a hot tags cloud

we add a line so the query becomes

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
and     entry_date > @sevedaysago
group by normal_tag_name


(sorry for the pseudo code, I could do it in Sybase TSQL but its no more helpful than pseudo code). I have again avoided the join and believe that a hot tags cloud should be based upon the entry's date not on a tag created date so as to keep the 'hot' concept based upon what you've written, not what you've tagged. (This may be an important distinction and also a perfromance enhancement when doing a site wide "Hot Tags Cloud")

Further replies are cut into the mail body below, and I have deleted some copies of my signature file.

Elias Torres wrote:

On 1/6/06, Allen Gilliland <[EMAIL PROTECTED]> wrote:
On Fri, 2006-01-06 at 07:05, David Levy wrote:
This has been written after the proposal has been updated and the table 
defintion documented there. i.e

http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags

But I have decided to extent this thread; it pertains to the database data 
model.

I have some questions.

is id a surrogate (sequential or hashed) key
what is entryid? i.e. the article URL? i.e. foreign key to entries
what is name? (The value of the tag?) i.e. foreign key to Tags table
Anil answered these in his email.
Do I need the definitions of the entries and entries2tags to understand what we 
need to do?

I have come to the conclusion that using surrogate keys is(nearly
always) a mistake. The space management advantages are normally negated by the 
requirement to join against the tables holding the real keys. You normally require 
an additonal unique index to enforce the uniqueness constraints against both the 
surrogate & real keys, and will often need to de-optimise the TNF structure to 
replicate the real data around the
database to avoid otherwise unnecessary joins. Maintaining these
de-optimisations have both human and perfromance costs.
I am often of the same opinion, but part of the problem may be that Hibernate 
(our ORM persistence tool) suggests using surrogate keys and probably doesn't 
like multi column primary keys.
Bother, so if we want the real key data on the table it needs to be copied down.


I agree with David. If you see that's exactly why we didn't normalize
our table to create a tag table. We are instead just duplicating the
actual tag because a surrogate key would take up even much more space
than most of the tags. BTW, this is working fine with Hibernate so far,
although, I believe some of the queries were written using HQL (???)
because of their complexity in grouping, etc.
and the tags entity is implemented in an index.


If you agree that real keys should be used then we need to consider
capitalisation & plurals of tags. Easy one first - leave plural
management to the authors. I know that I  am using blog & blogs
seperately, althouhg maybe I should have chosen blogging & blog to be
clearer, but the point is that tag authors are not going to think very
hard about their tags, this is why the del.icio.us tag prompt scheme is
so good. Delicious manages to ignore capitalisation, which I think is
good. Its certainly best for query and aggregation for popularity
queries., I can't believe that anyone wants to know that both "internet"
& "Internet" are popular tags (technorati!)

This implies that we might have two fields to document a tag,

user_tag_name            char(32)
normal_tag_name        char(32)  is proper(user_tag_name)
I agree that plurals should be managed by users and tags should be 
case-insensitive.  My approach to this would be a bit more simplified though, I 
would prefer to simply have the software convert all tags to lowercase after 
user input.  I suppose some people could get a little upset by that, but it 
seems so much easier.

Also, is that "proper()" function part of ANSI sql?  Will it work on all databases or 
have an equivalent on all databases?  I am also confused on how that would work when you are trying 
to determine how many times the tag "java" is applied.  How would you do that?

Actually I don't know if its an SQL standard or not, but I have checked my Postgres manual and it doesn't seem to be there. So lower case it is then. , The code above uses the normal_tag_name to give us the sum of queries java + Java (actually plus jAva etc ). It is the reason why the queries above use the normal tag as the projection list column.


Our current implementation performs a lowercase on all tags (I took that
from del.icio.us. But I'm torn on whether or not to let users manage
plurals. I'll be adding to our internal server a filter using the Porter
Stemming [1] algorithm to see if its results are desirable. I'll report
on the experiment as we get some results.
as you see I have been persauded that lower is right.

I'm not necessarily convinced that the prompt scheme is all that
beneficial. There are some possible consequences that I have studied a
little bit on my TagRank [2] paper. The problem is that care must be
taken to "suggest" tags because if not, the users will reach a local
maximum and never allowing other tags to reach popular/meaningful
levels in the network.
(sorry for the pseudo-code, but normal* needs to be enforced with a
constraint or trigger.)

This then leaves us with the question of indices (indexes?)

The implications of my data model is that the entry_author key should be
copied down through the entry tabel into the entry-tags table.

author_name            char(32) is (part of) FK to  entry
entry_name              char(128) is (part of) FK to entry

although, a full roller URL contains the author name, but I assume that
the entryname consists of the string used in the entry & anchor queries
so I have a table that looks like this (again pseudo code and I have
guessed at the database column types).

create definedtags as (
author_name                    char(32)
entry_name (or id?) ,       char(128)
user_tag_name,               char(64)
normal_tag_name,           char(64)
entry_date                         datetime
date_created                    datetime )
i think this is still pretty similar to what we have now correct?  the main 
difference being that you are tracking which author wrote the specified tag, 
along with what date the tagged entry was published.  i like the idea of having 
the FK to the author included because that would allow us to easily lookup all 
tags used by a specific author.
Yup, see above

is there a reason to copy down the entry_date rather than access it via a join 
on the weblogentry table?  you can join with the weblogentry table using the 
entry_id column.  how were you planning to use the entry_date field?
see above, I don't want to join to the entry table, and this goal also impacts my index design. entry_date allows hot tags queries to be driven by the entry not the tagged date

what acts as the primary key? (author_name, entry_id, user_tag_name)?
my PK is author_name, entry_id and normal_tag_name

we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key.
yeah, looks like it

i'm not sure that would be much of an issue though because it doesn't look like 
you are planning for any joins for the tag names, correct?

I'm trying to avoid any joins, but if you are looking for entries on a blog and tagged, then it would be good to enter the query on author_name, but since we have not copied the title down to the entry2tags table we still need the join and can go in on author_name on either table, but best do it on entry table (see below).

select  entry.title
from    entry, entry2tags e2t
where e2t.entry_id = entry.id
and     entry.author_name = @KnownName
and     e2t.normal_tag_name in (@TagQueryList)


Right, Allen. This is very similar to what we already have. I'm not sure
why having author name here, when we already have that through a join
with the entry_id.

That's right, but I don't want to join, the entry2tags table is big enough without joining.

I'm not sure why do we need entry_date when we have
the tagging date.
I think the queries should be driven through the entry publication date.

I do like the normal tag name. I was thinking this too for the output of the 
Porter Stemming algorithm so I wouldn't lose the original information entered 
by the user. Everything else is the same :-).
I need to read your references to understand this, but I think you agree this is OK

Another point I want to make is the fact, that we do a little bit of
extra processing when saving an existing entry that make sure it keeps
the original date for each tag. For example, when I first created an
entry I tagged it A,B,C. The first time I edited, I removed B. The A and
C tags will retain the dates when they were added as opposed to the edit
date.
Are you holding entry_tag data on the entry table?

Additionally, we have a question of what to do with spaces. Should tags
be multi-word or not? My suggestion to Phay (one of our developers) was
to use spaces as separators in the input field, therefore not supporting
spaces. But we could do multiple things to support spaces, such as
quoting multi-word tags. I believe Flickr supports multi-words but they
remove the spaces from the tags, but technorati does maintain spaces. I
don't like them myself, because I think it fragments the tag space much
more than single words and you could still use intersections to get the
sort of the same result.


create constraint defined_tags_name as
   normal_tag_name=proper(user_tag_name)

create constraint defined_tags_entry_date as
   entry_date = select entry_date from entry
                          where definedtags.entry_name = entry.entry_name

and the following indexes

create dt.tags on definedtags
   as author_name, entry_name, normal_tag_name unique
This is the real primary key

create dt.tags2 on definedtags
   as normal_tag_name
This is the tag entity (or operational master)

create dt.entries on definedtags
   as author_name, entry_name

create dt.date_written
   on definedtags as entry_date.

I have written this in a hurry so it may not be though out as well as
some of my writing, but hopefully this is collaborative development.
Also I have difficulty in commenting on and reading some of the
front-end & java orientated stuff. (I have ordered a couple of books to
help me catchup). Hopefully this is helpfull

I think this all makes sense to me so far and I certainly appreciate the help.  
I think getting the data model correct is a *very* important issue before we 
move forward with implemenatation, so I'm glad we are having this discussion.

-- Allen

[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
[2] http://torrez.us/archives/2005/07/13/tagrank.pdf

Elias Torres wrote:

Welcome David to the Roller list.

Thank you for your post. I have read your blog post on a tag data
model for Roller. I'm looking forward to your relational algebra and
query cost analysis. I wanted to tell you that we (IBM) have already
added basic tagging support to Roller and it actually supports a
TagCloud. I am supposed to put a proposal in the roller wiki so others
could comment and once I do that, you could put your comments there as
well.

Just to kickstart the conversation I'm including the tagging table we
are currently using.

create table weblogentrytag (
  id              varchar(48)   not null primary key,
  entryid        varchar(48)   not null,
  name            varchar(255)  not null,
  tagtime         timestamp     not null
);

We have basically two tables: entries and entry2tags, but are missing
a tag table. At first, I was very set on having a tag table and use a
foreign key to "save" space on repetitive tag names. But I was shown
it's not really a big space saving technique, especially since tag
names are relatively short storing a guid or int would almost be
comparable in space. There are also increased costs in inserting and
joining on tables to get tag names if using a foreign key, so we have
settle on this for now until we have other queries requirements. I'll
be summarizing all of our changes to roller to support tagging in a
wiki proposal soon.

Regarding the use of the list, some people have been using nabble.com
to interact with it. Maybe you can give it a try. I simply use gmail.

http://www.nabble.com/Roller-f12275.html

Regards,

Elias

On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:


I have documented a data model for tags. This is held at my blog

http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database

I have a graphic demonstrating the relationship between authors,
articles and tags and illustrating the first and obvious indexes. (I
have identified that both "Date Published" and tag aggregates are
missing from the model).  Since the model was built to help me
understand del.icio.us, I call the entities Users, Bookmarks and Tags,
but hopefully its simple to see that these are pretty synonomous to
authors, articles and tags.

I hope that this is useful for those looking at implementing tags.

I am still working out how to use the mail-list, so I hope that x-refing you to 
my blog isn't deprecated. I also need to work how to maintain thread 
connections i.e. undertake a reply.
--

Dave
--

Dave


--

Dave

<http://www.sun.com>      * David Levy *






Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]

Sun Proprietary & Confidential . This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privilidged information. Any unauthorised review, use, disclosure or distribution is prohibited. If you are not the intended recepient, please contact the sender by reply e-mail and destroy all copies of the original message.

Reply via email to