Re: About Tags, a proposed data-model

Elias Torres Fri, 06 Jan 2006 11:17:53 -0800

On 1/6/06, Allen Gilliland <[EMAIL PROTECTED]> wrote:
> On Fri, 2006-01-06 at 07:05, David Levy wrote:
> > This has been written after the proposal has been updated and the table
> > defintion documented there. i.e
> >
> > http://rollerweblogger.org/wiki/Wiki.jsp?page=Proposal_WeblogTags
> >
> > But I have decided to extent this thread; it pertains to the database data 
> > model.
> >
> > I have some questions.
> >
> > is id a surrogate (sequential or hashed) key
> > what is entryid? i.e. the article URL? i.e. foreign key to entries
> > what is name? (The value of the tag?) i.e. foreign key to Tags table
>
> Anil answered these in his email.
>
> >
> > Do I need the definitions of the entries and entries2tags to understand 
> > what we need to do?
> >
> > I have come to the conclusion that using surrogate keys is(nearly
> > always) a mistake. The space management advantages are normally negated
> > by the requirement to join against the tables holding the real keys. You
> > normally require an additonal unique index to enforce the uniqueness
> > constraints against both the surrogate & real keys, and will often need
> > to de-optimise the TNF structure to replicate the real data around the
> > database to avoid otherwise unnecessary joins. Maintaining these
> > de-optimisations have both human and perfromance costs.
>
> I am often of the same opinion, but part of the problem may be that Hibernate 
> (our ORM persistence tool) suggests using surrogate keys and probably doesn't 
> like multi column primary keys.
>


I agree with David. If you see that's exactly why we didn't normalize
our table to create a tag table. We are instead just duplicating the
actual tag because a surrogate key would take up even much more space
than most of the tags. BTW, this is working fine with Hibernate so far,
although, I believe some of the queries were written using HQL (???)
because of their complexity in grouping, etc.


> >
> > If you agree that real keys should be used then we need to consider
> > capitalisation & plurals of tags. Easy one first - leave plural
> > management to the authors. I know that I  am using blog & blogs
> > seperately, althouhg maybe I should have chosen blogging & blog to be
> > clearer, but the point is that tag authors are not going to think very
> > hard about their tags, this is why the del.icio.us tag prompt scheme is
> > so good. Delicious manages to ignore capitalisation, which I think is
> > good. Its certainly best for query and aggregation for popularity
> > queries., I can't believe that anyone wants to know that both "internet"
> > & "Internet" are popular tags (technorati!)
> >
> > This implies that we might have two fields to document a tag,
> >
> > user_tag_name            char(32)
> > normal_tag_name        char(32)  is proper(user_tag_name)
>
> I agree that plurals should be managed by users and tags should be 
> case-insensitive.  My approach to this would be a bit more simplified though, 
> I would prefer to simply have the software convert all tags to lowercase 
> after user input.  I suppose some people could get a little upset by that, 
> but it seems so much easier.
>
> Also, is that "proper()" function part of ANSI sql?  Will it work on all 
> databases or have an equivalent on all databases?  I am also confused on how 
> that would work when you are trying to determine how many times the tag 
> "java" is applied.  How would you do that?
>

Our current implementation performs a lowercase on all tags (I took that
from del.icio.us. But I'm torn on whether or not to let users manage
plurals. I'll be adding to our internal server a filter using the Porter
Stemming [1] algorithm to see if its results are desirable. I'll report
on the experiment as we get some results.

I'm not necessarily convinced that the prompt scheme is all that
beneficial. There are some possible consequences that I have studied a
little bit on my TagRank [2] paper. The problem is that care must be
taken to "suggest" tags because if not, the users will reach a local
maximum and never allowing other tags to reach popular/meaningful
levels in the network.

> >
> > (sorry for the pseudo-code, but normal* needs to be enforced with a
> > constraint or trigger.)
> >
> > This then leaves us with the question of indices (indexes?)
> >
> > The implications of my data model is that the entry_author key should be
> > copied down through the entry tabel into the entry-tags table.
> >
> > author_name            char(32) is (part of) FK to  entry
> > entry_name              char(128) is (part of) FK to entry
> >
> > although, a full roller URL contains the author name, but I assume that
> > the entryname consists of the string used in the entry & anchor queries
> > so I have a table that looks like this (again pseudo code and I have
> > guessed at the database column types).
> >
> > create definedtags as (
> > author_name                    char(32)
> > entry_name (or id?) ,       char(128)
> > user_tag_name,               char(64)
> > normal_tag_name,           char(64)
> > entry_date                         datetime
> > date_created                    datetime )
>
> i think this is still pretty similar to what we have now correct?  the main 
> difference being that you are tracking which author wrote the specified tag, 
> along with what date the tagged entry was published.  i like the idea of 
> having the FK to the author included because that would allow us to easily 
> lookup all tags used by a specific author.
>
> is there a reason to copy down the entry_date rather than access it via a 
> join on the weblogentry table?  you can join with the weblogentry table using 
> the entry_id column.  how were you planning to use the entry_date field?
>
> what acts as the primary key?  (author_name, entry_id, user_tag_name)?  we 
> may still need to use a surrogate key to uniquely identify the row to avoid 
> having a multi column primary key.  i'm not sure that would be much of an 
> issue though because it doesn't look like you are planning for any joins for 
> the tag names, correct?
>

Right, Allen. This is very similar to what we already have. I'm not sure
why having author name here, when we already have that through a join
with the entry_id. I'm not sure why do we need entry_date when we have
the tagging date. I do like the normal tag name. I was thinking this too
for the output of the Porter Stemming algorithm so I wouldn't lose the
original information entered by the user. Everything else is the same :-).

Another point I want to make is the fact, that we do a little bit of
extra processing when saving an existing entry that make sure it keeps
the original date for each tag. For example, when I first created an
entry I tagged it A,B,C. The first time I edited, I removed B. The A and
C tags will retain the dates when they were added as opposed to the edit
date.

Additionally, we have a question of what to do with spaces. Should tags
be multi-word or not? My suggestion to Phay (one of our developers) was
to use spaces as separators in the input field, therefore not supporting
spaces. But we could do multiple things to support spaces, such as
quoting multi-word tags. I believe Flickr supports multi-words but they
remove the spaces from the tags, but technorati does maintain spaces. I
don't like them myself, because I think it fragments the tag space much
more than single words and you could still use intersections to get the
sort of the same result.


> >
> > create constraint defined_tags_name as
> >     normal_tag_name=proper(user_tag_name)
> >
> > create constraint defined_tags_entry_date as
> >     entry_date = select entry_date from entry
> >                            where definedtags.entry_name = entry.entry_name
> >
> > and the following indexes
> >
> > create dt.tags on definedtags
> >     as author_name, entry_name, normal_tag_name unique
> >
> > create dt.tags2 on definedtags
> >     as normal_tag_name
> >
> > create dt.entries on definedtags
> >     as author_name, entry_name
> >
> > create dt.date_written
> >     on definedtags as entry_date.
> >
> > I have written this in a hurry so it may not be though out as well as
> > some of my writing, but hopefully this is collaborative development.
> > Also I have difficulty in commenting on and reading some of the
> > front-end & java orientated stuff. (I have ordered a couple of books to
> > help me catchup). Hopefully this is helpfull
> >
>
> I think this all makes sense to me so far and I certainly appreciate the 
> help.  I think getting the data model correct is a *very* important issue 
> before we move forward with implemenatation, so I'm glad we are having this 
> discussion.
>
> -- Allen

[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
[2] http://torrez.us/archives/2005/07/13/tagrank.pdf

>
> >
> >
> > Elias Torres wrote:
> >
> > >Welcome David to the Roller list.
> > >
> > >Thank you for your post. I have read your blog post on a tag data
> > >model for Roller. I'm looking forward to your relational algebra and
> > >query cost analysis. I wanted to tell you that we (IBM) have already
> > >added basic tagging support to Roller and it actually supports a
> > >TagCloud. I am supposed to put a proposal in the roller wiki so others
> > >could comment and once I do that, you could put your comments there as
> > >well.
> > >
> > >Just to kickstart the conversation I'm including the tagging table we
> > >are currently using.
> > >
> > >create table weblogentrytag (
> > >    id              varchar(48)   not null primary key,
> > >    entryid        varchar(48)   not null,
> > >    name            varchar(255)  not null,
> > >    tagtime         timestamp     not null
> > >);
> > >
> > >We have basically two tables: entries and entry2tags, but are missing
> > >a tag table. At first, I was very set on having a tag table and use a
> > >foreign key to "save" space on repetitive tag names. But I was shown
> > >it's not really a big space saving technique, especially since tag
> > >names are relatively short storing a guid or int would almost be
> > >comparable in space. There are also increased costs in inserting and
> > >joining on tables to get tag names if using a foreign key, so we have
> > >settle on this for now until we have other queries requirements. I'll
> > >be summarizing all of our changes to roller to support tagging in a
> > >wiki proposal soon.
> > >
> > >Regarding the use of the list, some people have been using nabble.com
> > >to interact with it. Maybe you can give it a try. I simply use gmail.
> > >
> > >http://www.nabble.com/Roller-f12275.html
> > >
> > >Regards,
> > >
> > >Elias
> > >
> > >On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:
> > >
> > >
> > >>I have documented a data model for tags. This is held at my blog
> > >>
> > >>http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
> > >>
> > >>I have a graphic demonstrating the relationship between authors,
> > >>articles and tags and illustrating the first and obvious indexes. (I
> > >>have identified that both "Date Published" and tag aggregates are
> > >>missing from the model).  Since the model was built to help me
> > >>understand del.icio.us, I call the entities Users, Bookmarks and Tags,
> > >>but hopefully its simple to see that these are pretty synonomous to
> > >>authors, articles and tags.
> > >>
> > >>I hope that this is useful for those looking at implementing tags.
> > >>
> > >>I am still working out how to use the mail-list, so I hope that x-refing
> > >>you to my blog isn't deprecated. I also need to work how to maintain
> > >>thread connections i.e. undertake a reply.
> > >>--
> > >>
> > >>Dave
> > >>
> > >><http://www.sun.com>    * David Levy *
> > >>*Sun Microsystems Ltd.*
> > >>55, King William St.,
> > >>London EC4R 9ND United Kingdom
> > >>
> > >>Phone +44 (0) 20 7469 9908/x18308
> > >>Mobile +44 (0) 7710-360922
> > >>Blog http://blogs.sun.com/DaveLevy
> > >>Email [EMAIL PROTECTED]
> > >>
> > >>Sun Proprietary & Confidential . This e-mail message is for the sole use
> > >>of the intended recipient(s) and may contain confidential and
> > >>privilidged information. Any unauthorised review, use, disclosure or
> > >>distribution is prohibited. If you are not the intended recepient,
> > >>please contact the sender by reply e-mail and destroy all copies of the
> > >>original message.
> > >>
> > >>
> > >>
> > >>
> >
> > --
> >
> > Dave
> >
> > <http://www.sun.com>  * David Levy *
> > *Sun Microsystems Ltd.*
> > 55, King William St.,
> > London EC4R 9ND United Kingdom
> >
> > Phone +44 (0) 20 7469 9908/x18308
> > Mobile +44 (0) 7710-360922
> > Blog http://blogs.sun.com/DaveLevy
> > Email [EMAIL PROTECTED]
> >
> > Sun Proprietary & Confidential . This e-mail message is for the sole use
> > of the intended recipient(s) and may contain confidential and
> > privilidged information. Any unauthorised review, use, disclosure or
> > distribution is prohibited. If you are not the intended recepient,
> > please contact the sender by reply e-mail and destroy all copies of the
> > original message.
> >
>
>

Re: About Tags, a proposed data-model

Reply via email to