Re: About Tags, a proposed data-model

John Hoffmann Mon, 30 Jan 2006 07:08:56 -0800

I'd just like to add that performing joins in sql is not something to beavoided, the impact can be almost completely mitigated by caching theresults. The only real cause for concern is for truly massive datasetsin which the join cannot be performed in the amount of memory availableto the database.


-John


Allen Gilliland wrote:

I don't want to lose this thread because I think there are still some ideas to 
continue flushing out.  More comments inline ...

On Fri, 2006-01-13 at 05:14, David Levy wrote:

Sorry to have taken so long.
The denormalisations of the author_name (which may be owner name) andentrydate are to support queries. This is because I expect a macro tocreate a tag cloud for a user so that the html versions can have the tagcloud,
So

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
group by normal_tag_name
gives us the data required for a tag cloud, for a single blog . No joinas you can see, where it gets fun is if you want a hot tags cloud
we add a line so the query becomes

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
and     entry_date > @sevedaysago
group by normal_tag_name


very cool stuff ... i like the looks of that.

... lots of stuff chopped out here ...

is there a reason to copy down the entry_date rather than access it via a join 
on the weblogentry table?  you can join with the weblogentry table using the 
entry_id column.  how were you planning to use the entry_date field?
see above, I don't want to join to the entry table, and this goal alsoimpacts my index design. entry_date allows hot tags queries to bedriven by the entry not the tagged date


ok.  I agree that joins are a likely performance problem, but my next question then 
becomes ... How do we plan to deal with getting the data for the list of entries marked 
with a given tag?  I am expecting that when someone uses the tag dashboard or a tag cloud 
to try and view a list of entries with the tag "foo", that list will look 
something like the Roller front page.  example ...

url = /roller/tag/entries/java+netbeans

you then populate a page with 50? 25? entry summaries for people to browse 
through and those summaries will at least require the entry title and a summary 
of the entry content and may also require the entry date, category, author, and 
weblog title.  I would think we are going to require a join to get that data.

what acts as the primary key? (author_name, entry_id, user_tag_name)?
my PK is author_name, entry_id and normal_tag_name
we may still need to use a surrogate key to uniquely identify the row to avoid having a multi column primary key.
yeah, looks like it
i'm not sure that would be much of an issue though because it doesn't look like 
you are planning for any joins for the tag names, correct?
I'm trying to avoid any joins, but if you are looking for entries on ablog and tagged, then it would be good to enter the query onauthor_name, but since we have not copied the title down to theentry2tags table we still need the join and can go in on author_name oneither table, but best do it on entry table (see below).
select  entry.title
from    entry, entry2tags e2t
where e2t.entry_id = entry.id
and     entry.author_name = @KnownName
and     e2t.normal_tag_name in (@TagQueryList)


based on my example above, how would we get the necessary entry data when we 
don't know the author name because we are searching through the entire tag 
system, not just from a single author or weblog?

-- Allen

Right, Allen. This is very similar to what we already have. I'm not sure
why having author name here, when we already have that through a join
with the entry_id.

That's right, but I don't want to join, the entry2tags table is bigenough without joining.

I'm not sure why do we need entry_date when we have

the tagging date.

I think the queries should be driven through the entry publication date.

I do like the normal tag name. I was thinking this too for the output of the 
Porter Stemming algorithm so I wouldn't lose the original information entered 
by the user. Everything else is the same :-).

I need to read your references to understand this, but I think you agreethis is OK

Another point I want to make is the fact, that we do a little bit of
extra processing when saving an existing entry that make sure it keeps
the original date for each tag. For example, when I first created an
entry I tagged it A,B,C. The first time I edited, I removed B. The A and
C tags will retain the dates when they were added as opposed to the edit
date.

Are you holding entry_tag data on the entry table?

Additionally, we have a question of what to do with spaces. Should tags
be multi-word or not? My suggestion to Phay (one of our developers) was
to use spaces as separators in the input field, therefore not supporting
spaces. But we could do multiple things to support spaces, such as
quoting multi-word tags. I believe Flickr supports multi-words but they
remove the spaces from the tags, but technorati does maintain spaces. I
don't like them myself, because I think it fragments the tag space much
more than single words and you could still use intersections to get the
sort of the same result.

create constraint defined_tags_name as
  normal_tag_name=proper(user_tag_name)

create constraint defined_tags_entry_date as
  entry_date = select entry_date from entry
                         where definedtags.entry_name = entry.entry_name

and the following indexes

create dt.tags on definedtags
  as author_name, entry_name, normal_tag_name unique

This is the real primary key

create dt.tags2 on definedtags
  as normal_tag_name

This is the tag entity (or operational master)

create dt.entries on definedtags
  as author_name, entry_name

create dt.date_written
  on definedtags as entry_date.

I have written this in a hurry so it may not be though out as well as
some of my writing, but hopefully this is collaborative development.
Also I have difficulty in commenting on and reading some of the
front-end & java orientated stuff. (I have ordered a couple of books to
help me catchup). Hopefully this is helpfull

I think this all makes sense to me so far and I certainly appreciate the help.  
I think getting the data model correct is a *very* important issue before we 
move forward with implemenatation, so I'm glad we are having this discussion.

-- Allen

[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
[2] http://torrez.us/archives/2005/07/13/tagrank.pdf

Elias Torres wrote:

Welcome David to the Roller list.

Thank you for your post. I have read your blog post on a tag data
model for Roller. I'm looking forward to your relational algebra and
query cost analysis. I wanted to tell you that we (IBM) have already
added basic tagging support to Roller and it actually supports a
TagCloud. I am supposed to put a proposal in the roller wiki so others
could comment and once I do that, you could put your comments there as
well.

Just to kickstart the conversation I'm including the tagging table we
are currently using.

create table weblogentrytag (
 id              varchar(48)   not null primary key,
 entryid        varchar(48)   not null,
 name            varchar(255)  not null,
 tagtime         timestamp     not null
);

We have basically two tables: entries and entry2tags, but are missing
a tag table. At first, I was very set on having a tag table and use a
foreign key to "save" space on repetitive tag names. But I was shown
it's not really a big space saving technique, especially since tag
names are relatively short storing a guid or int would almost be
comparable in space. There are also increased costs in inserting and
joining on tables to get tag names if using a foreign key, so we have
settle on this for now until we have other queries requirements. I'll
be summarizing all of our changes to roller to support tagging in a
wiki proposal soon.

Regarding the use of the list, some people have been using nabble.com
to interact with it. Maybe you can give it a try. I simply use gmail.

http://www.nabble.com/Roller-f12275.html

Regards,

Elias

On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:

I have documented a data model for tags. This is held at my blog

http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database

I have a graphic demonstrating the relationship between authors,
articles and tags and illustrating the first and obvious indexes. (I
have identified that both "Date Published" and tag aggregates are
missing from the model).  Since the model was built to help me
understand del.icio.us, I call the entities Users, Bookmarks and Tags,
but hopefully its simple to see that these are pretty synonomous to
authors, articles and tags.

I hope that this is useful for those looking at implementing tags.

I am still working out how to use the mail-list, so I hope that x-refing you to 
my blog isn't deprecated. I also need to work how to maintain thread 
connections i.e. undertake a reply.
--

Dave

--

Dave

--

Dave

<http://www.sun.com>      * David Levy *






Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]

Sun Proprietary & Confidential . This e-mail message is for the sole useof the intended recipient(s) and may contain confidential andprivilidged information. Any unauthorised review, use, disclosure ordistribution is prohibited. If you are not the intended recepient,please contact the sender by reply e-mail and destroy all copies of theoriginal message.

Re: About Tags, a proposed data-model

Reply via email to