Re: About Tags, a proposed data-model

Allen Gilliland Fri, 06 Jan 2006 10:06:05 -0800

On Fri, 2006-01-06 at 07:05, David Levy wrote:
> This has been written after the proposal has been updated and the table 
> defintion documented there. i.e
> 
> http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
> 
> But I have decided to extent this thread; it pertains to the database data 
> model.
> 
> I have some questions.
> 
> is id a surrogate (sequential or hashed) key
> what is entryid? i.e. the article URL? i.e. foreign key to entries
> what is name? (The value of the tag?) i.e. foreign key to Tags table


Anil answered these in his email.

> 
> Do I need the definitions of the entries and entries2tags to understand what 
> we need to do?
> 
> I have come to the conclusion that using surrogate keys is(nearly 
> always) a mistake. The space management advantages are normally negated 
> by the requirement to join against the tables holding the real keys. You 
> normally require an additonal unique index to enforce the uniqueness 
> constraints against both the surrogate & real keys, and will often need 
> to de-optimise the TNF structure to replicate the real data around the 
> database to avoid otherwise unnecessary joins. Maintaining these 
> de-optimisations have both human and perfromance costs.

I am often of the same opinion, but part of the problem may be that Hibernate 
(our ORM persistence tool) suggests using surrogate keys and probably doesn't 
like multi column primary keys.

> 
> If you agree that real keys should be used then we need to consider 
> capitalisation & plurals of tags. Easy one first - leave plural 
> management to the authors. I know that I  am using blog & blogs 
> seperately, althouhg maybe I should have chosen blogging & blog to be 
> clearer, but the point is that tag authors are not going to think very 
> hard about their tags, this is why the del.icio.us tag prompt scheme is 
> so good. Delicious manages to ignore capitalisation, which I think is 
> good. Its certainly best for query and aggregation for popularity 
> queries., I can't believe that anyone wants to know that both "internet" 
> & "Internet" are popular tags (technorati!)
> 
> This implies that we might have two fields to document a tag,
> 
> user_tag_name            char(32)
> normal_tag_name        char(32)  is proper(user_tag_name)

I agree that plurals should be managed by users and tags should be 
case-insensitive.  My approach to this would be a bit more simplified though, I 
would prefer to simply have the software convert all tags to lowercase after 
user input.  I suppose some people could get a little upset by that, but it 
seems so much easier.  

Also, is that "proper()" function part of ANSI sql?  Will it work on all 
databases or have an equivalent on all databases?  I am also confused on how 
that would work when you are trying to determine how many times the tag "java" 
is applied.  How would you do that?

> 
> (sorry for the pseudo-code, but normal* needs to be enforced with a 
> constraint or trigger.)
> 
> This then leaves us with the question of indices (indexes?)
> 
> The implications of my data model is that the entry_author key should be 
> copied down through the entry tabel into the entry-tags table.
> 
> author_name            char(32) is (part of) FK to  entry
> entry_name              char(128) is (part of) FK to entry
> 
> although, a full roller URL contains the author name, but I assume that 
> the entryname consists of the string used in the entry & anchor queries 
> so I have a table that looks like this (again pseudo code and I have 
> guessed at the database column types).
> 
> create definedtags as (
> author_name                    char(32)
> entry_name (or id?) ,       char(128)
> user_tag_name,               char(64)
> normal_tag_name,           char(64)
> entry_date                         datetime
> date_created                    datetime )

i think this is still pretty similar to what we have now correct?  the main 
difference being that you are tracking which author wrote the specified tag, 
along with what date the tagged entry was published.  i like the idea of having 
the FK to the author included because that would allow us to easily lookup all 
tags used by a specific author.

is there a reason to copy down the entry_date rather than access it via a join 
on the weblogentry table?  you can join with the weblogentry table using the 
entry_id column.  how were you planning to use the entry_date field?

what acts as the primary key?  (author_name, entry_id, user_tag_name)?  we may 
still need to use a surrogate key to uniquely identify the row to avoid having 
a multi column primary key.  i'm not sure that would be much of an issue though 
because it doesn't look like you are planning for any joins for the tag names, 
correct?

> 
> create constraint defined_tags_name as
>     normal_tag_name=proper(user_tag_name)
> 
> create constraint defined_tags_entry_date as
>     entry_date = select entry_date from entry
>                            where definedtags.entry_name = entry.entry_name
> 
> and the following indexes
> 
> create dt.tags on definedtags
>     as author_name, entry_name, normal_tag_name unique
> 
> create dt.tags2 on definedtags
>     as normal_tag_name
> 
> create dt.entries on definedtags
>     as author_name, entry_name
> 
> create dt.date_written
>     on definedtags as entry_date.
> 
> I have written this in a hurry so it may not be though out as well as 
> some of my writing, but hopefully this is collaborative development. 
> Also I have difficulty in commenting on and reading some of the 
> front-end & java orientated stuff. (I have ordered a couple of books to 
> help me catchup). Hopefully this is helpfull
> 

I think this all makes sense to me so far and I certainly appreciate the help.  
I think getting the data model correct is a *very* important issue before we 
move forward with implemenatation, so I'm glad we are having this discussion.

-- Allen

> 
> 
> Elias Torres wrote:
> 
> >Welcome David to the Roller list.
> >
> >Thank you for your post. I have read your blog post on a tag data
> >model for Roller. I'm looking forward to your relational algebra and
> >query cost analysis. I wanted to tell you that we (IBM) have already
> >added basic tagging support to Roller and it actually supports a
> >TagCloud. I am supposed to put a proposal in the roller wiki so others
> >could comment and once I do that, you could put your comments there as
> >well.
> >
> >Just to kickstart the conversation I'm including the tagging table we
> >are currently using.
> >
> >create table weblogentrytag (
> >    id              varchar(48)   not null primary key,
> >    entryid        varchar(48)   not null,
> >    name            varchar(255)  not null,
> >    tagtime         timestamp     not null
> >);
> >
> >We have basically two tables: entries and entry2tags, but are missing
> >a tag table. At first, I was very set on having a tag table and use a
> >foreign key to "save" space on repetitive tag names. But I was shown
> >it's not really a big space saving technique, especially since tag
> >names are relatively short storing a guid or int would almost be
> >comparable in space. There are also increased costs in inserting and
> >joining on tables to get tag names if using a foreign key, so we have
> >settle on this for now until we have other queries requirements. I'll
> >be summarizing all of our changes to roller to support tagging in a
> >wiki proposal soon.
> >
> >Regarding the use of the list, some people have been using nabble.com
> >to interact with it. Maybe you can give it a try. I simply use gmail.
> >
> >http://www.nabble.com/Roller-f12275.html
> >
> >Regards,
> >
> >Elias
> >
> >On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:
> >  
> >
> >>I have documented a data model for tags. This is held at my blog
> >>
> >>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> >>
> >>I have a graphic demonstrating the relationship between authors,
> >>articles and tags and illustrating the first and obvious indexes. (I
> >>have identified that both "Date Published" and tag aggregates are
> >>missing from the model).  Since the model was built to help me
> >>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> >>but hopefully its simple to see that these are pretty synonomous to
> >>authors, articles and tags.
> >>
> >>I hope that this is useful for those looking at implementing tags.
> >>
> >>I am still working out how to use the mail-list, so I hope that x-refing
> >>you to my blog isn't deprecated. I also need to work how to maintain
> >>thread connections i.e. undertake a reply.
> >>--
> >>
> >>Dave
> >>
> >><http://www.sun.com>    * David Levy *
> >>*Sun Microsystems Ltd.*
> >>55, King William St.,
> >>London EC4R 9ND United Kingdom
> >>
> >>Phone +44 (0) 20 7469 9908/x18308
> >>Mobile +44 (0) 7710-360922
> >>Blog http://blogs.sun.com/DaveLevy
> >>Email [EMAIL PROTECTED]
> >>
> >>Sun Proprietary & Confidential . This e-mail message is for the sole use
> >>of the intended recipient(s) and may contain confidential and
> >>privilidged information. Any unauthorised review, use, disclosure or
> >>distribution is prohibited. If you are not the intended recepient,
> >>please contact the sender by reply e-mail and destroy all copies of the
> >>original message.
> >>
> >>
> >>    
> >>
> 
> -- 
> 
> Dave
> 
> <http://www.sun.com>  * David Levy *
> *Sun Microsystems Ltd.*
> 55, King William St.,
> London EC4R 9ND United Kingdom
> 
> Phone +44 (0) 20 7469 9908/x18308
> Mobile +44 (0) 7710-360922
> Blog http://blogs.sun.com/DaveLevy
> Email [EMAIL PROTECTED]
> 
> Sun Proprietary & Confidential . This e-mail message is for the sole use 
> of the intended recipient(s) and may contain confidential and 
> privilidged information. Any unauthorised review, use, disclosure or 
> distribution is prohibited. If you are not the intended recepient, 
> please contact the sender by reply e-mail and destroy all copies of the 
> original message.
>

Re: About Tags, a proposed data-model

Reply via email to