Re: About Tags, a proposed data-model

Dave Johnson Mon, 30 Jan 2006 11:43:34 -0800


On Jan 30, 2006, at 1:45 PM, Allen Gilliland wrote:

I agree that is true *if* results are cached, but therein lies theproblem.
I have spent quite a bit of time working on caching and performancein Roller and with our current setup it's not caching that is hard,it's having a big enough cache that's hard. The fact is that ablog takes up a lot of space because you have to consider cachingentries, comments, bookmarks & folders, categories, referers, andtemplates. As a blog grows so do most of those things, especiallythe entries & comments. On top of just caching those objects wecurrently cache fully rendered pages and feeds, so that means ahandful of xml feeds and quite a few full html pages. The pointbeing, on a large site there is tons of data that needs cachingalready without having to cache tag related data.
Currently, I have no idea how we can expect to cache all the datathat would be needed for a full tagging system along witheverything else we cache right now.



Here are my thoughts on tags and tag search vs. caching

*** We can't cache everything

For example, if we allow people to perform arbitrary tag queries
via the Roller UI, we're not going be able to cache the results.
In that case, we're probably OK. After all, how many people
are going to be doing tag searches on a Roller site simultaneously.

*** Tag based newsfeeds are where the problem arises

The problem arises when we start to allow people to subscribe
to tag searches. In that case, we'll have newsfeed readers and
aggregators hammering away hourly night and day. What's
worse, the number of feeds will go to infinity.

And we already have too many feeds:

total feeds = (number of blogs) X (cats per blog) X (2 feed typesAtom and RSS)


So...

*** We should allow admins to disable tag based newsfeeds

We should allow arbitrary tag searches and getting the results as
newsfeeds, but we should also make it possible to turn both of those
off via Roller properties or the UI.

*** We should allow admins to define a finite list of site-wide feeds

What I'd like to do with Proposal Atlas is provide a way for a site
administrator to decide what feeds are to be displayed on the front
page of a Roller site and define a finite list of feeds to be provided
based on aggregations, tags, and internal objects (e.g. new user
and new blog feeds). The I'll have a new rev. of the proposal ready
in the next day or two.

- Dave

On Mon, 2006-01-30 at 07:07, John Hoffmann wrote:
I'd just like to add that performing joins in sql is not somethingto beavoided, the impact can be almost completely mitigated by cachingtheresults. The only real cause for concern is for truly massivedatasetsin which the join cannot be performed in the amount of memoryavailable
to the database.

-John

Allen Gilliland wrote:
I don't want to lose this thread because I think there are stillsome ideas to continue flushing out. More comments inline ...
On Fri, 2006-01-13 at 05:14, David Levy wrote:
Sorry to have taken so long.
The denormalisations of the author_name (which may be ownername) andentrydate are to support queries. This is because I expect amacro tocreate a tag cloud for a user so that the html versions can havethe tag
cloud,

So

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
group by normal_tag_name
gives us the data required for a tag cloud, for a single blog .No join
as you can see, where it gets fun is if you want a hot tags cloud

we add a line so the query becomes

select normal_tag_name, count(*)
from entry2tags
where author_name = "DaveLevy"
and     entry_date > @sevedaysago
group by normal_tag_name
very cool stuff ... i like the looks of that.

... lots of stuff chopped out here ...
is there a reason to copy down the entry_date rather thanaccess it via a join on the weblogentry table? you can joinwith the weblogentry table using the entry_id column. howwere you planning to use the entry_date field?
see above, I don't want to join to the entry table, and thisgoal also
impacts my index design.  entry_date allows hot tags queries to be
driven by the entry not the tagged date
ok. I agree that joins are a likely performance problem, but mynext question then becomes ... How do we plan to deal withgetting the data for the list of entries marked with a giventag? I am expecting that when someone uses the tag dashboard ora tag cloud to try and view a list of entries with the tag "foo",that list will look something like the Roller front page.example ...
url = /roller/tag/entries/java+netbeans
you then populate a page with 50? 25? entry summaries for peopleto browse through and those summaries will at least require theentry title and a summary of the entry content and may alsorequire the entry date, category, author, and weblog title. Iwould think we are going to require a join to get that data.
what acts as the primary key? (author_name, entry_id,user_tag_name)?
my PK is author_name, entry_id and normal_tag_name
we may still need to use a surrogate key to uniquely identifythe row to avoid having a multi column primary key.
yeah, looks like it
i'm not sure that would be much of an issue though because itdoesn't look like you are planning for any joins for the tagnames, correct?
I'm trying to avoid any joins, but if you are looking forentries on a
blog and tagged, then it would be good to enter the query on
author_name, but since we have not copied the title down to the
entry2tags table we still need the join and can go in onauthor_name on
either table, but best do it on entry table (see below).

select  entry.title
from    entry, entry2tags e2t
where e2t.entry_id = entry.id
and     entry.author_name = @KnownName
and     e2t.normal_tag_name in (@TagQueryList)
based on my example above, how would we get the necessary entrydata when we don't know the author name because we are searchingthrough the entire tag system, not just from a single author orweblog?
-- Allen
Right, Allen. This is very similar to what we already have. I'mnot surewhy having author name here, when we already have that througha join
with the entry_id.
That's right, but I don't want to join, the entry2tags table is big
enough without joining.
I'm not sure why do we need entry_date when we have
the tagging date.
I think the queries should be driven through the entrypublication date.
I do like the normal tag name. I was thinking this too for theoutput of the Porter Stemming algorithm so I wouldn't lose theoriginal information entered by the user. Everything else isthe same :-).
I need to read your references to understand this, but I thinkyou agree
this is OK
Another point I want to make is the fact, that we do a littlebit ofextra processing when saving an existing entry that make sureit keepsthe original date for each tag. For example, when I firstcreated anentry I tagged it A,B,C. The first time I edited, I removed B.The A andC tags will retain the dates when they were added as opposed tothe edit
date.
Are you holding entry_tag data on the entry table?
Additionally, we have a question of what to do with spaces.Should tagsbe multi-word or not? My suggestion to Phay (one of ourdevelopers) wasto use spaces as separators in the input field, therefore notsupporting
spaces. But we could do multiple things to support spaces, such as
quoting multi-word tags. I believe Flickr supports multi-wordsbut theyremove the spaces from the tags, but technorati does maintainspaces. Idon't like them myself, because I think it fragments the tagspace muchmore than single words and you could still use intersections toget the
sort of the same result.
create constraint defined_tags_name as
  normal_tag_name=proper(user_tag_name)

create constraint defined_tags_entry_date as
  entry_date = select entry_date from entry
where definedtags.entry_name =entry.entry_name
and the following indexes

create dt.tags on definedtags
  as author_name, entry_name, normal_tag_name unique
This is the real primary key
create dt.tags2 on definedtags
  as normal_tag_name
This is the tag entity (or operational master)
create dt.entries on definedtags
  as author_name, entry_name

create dt.date_written
  on definedtags as entry_date.
I have written this in a hurry so it may not be though out aswell assome of my writing, but hopefully this is collaborativedevelopment.
Also I have difficulty in commenting on and reading some of the
front-end & java orientated stuff. (I have ordered a coupleof books to
help me catchup). Hopefully this is helpfull
I think this all makes sense to me so far and I certainlyappreciate the help. I think getting the data model correctis a *very* important issue before we move forward withimplemenatation, so I'm glad we are having this discussion.
-- Allen
[1] http://www.tartarus.org/~martin/PorterStemmer/def.txt
[2] http://torrez.us/archives/2005/07/13/tagrank.pdf
Elias Torres wrote:
Welcome David to the Roller list.
Thank you for your post. I have read your blog post on a tagdatamodel for Roller. I'm looking forward to your relationalalgebra andquery cost analysis. I wanted to tell you that we (IBM) havealreadyadded basic tagging support to Roller and it actuallysupports aTagCloud. I am supposed to put a proposal in the roller wikiso otherscould comment and once I do that, you could put yourcomments there as
well.
Just to kickstart the conversation I'm including the taggingtable we
are currently using.

create table weblogentrytag (
 id              varchar(48)   not null primary key,
 entryid        varchar(48)   not null,
 name            varchar(255)  not null,
 tagtime         timestamp     not null
);
We have basically two tables: entries and entry2tags, butare missinga tag table. At first, I was very set on having a tag tableand use aforeign key to "save" space on repetitive tag names. But Iwas shownit's not really a big space saving technique, especiallysince tagnames are relatively short storing a guid or int wouldalmost becomparable in space. There are also increased costs ininserting andjoining on tables to get tag names if using a foreign key,so we havesettle on this for now until we have other queriesrequirements. I'llbe summarizing all of our changes to roller to supporttagging in a
wiki proposal soon.
Regarding the use of the list, some people have been usingnabble.comto interact with it. Maybe you can give it a try. I simplyuse gmail.
http://www.nabble.com/Roller-f12275.html

Regards,

Elias

On 1/4/06, David Levy <[EMAIL PROTECTED]> wrote:
I have documented a data model for tags. This is held at myblog
http://blogs.sun.com/roller/page/DaveLevy?entry=implementing_tags_in_a_database
I have a graphic demonstrating the relationship betweenauthors,articles and tags and illustrating the first and obviousindexes. (Ihave identified that both "Date Published" and tagaggregates are
missing from the model).  Since the model was built to help me
understand del.icio.us, I call the entities Users,Bookmarks and Tags,but hopefully its simple to see that these are prettysynonomous to
authors, articles and tags.
I hope that this is useful for those looking atimplementing tags.
I am still working out how to use the mail-list, so I hopethat x-refing you to my blog isn't deprecated. I also needto work how to maintain thread connections i.e. undertake areply.
--

Dave
--

Dave
--

Dave

<http://www.sun.com>      * David Levy *






Blog http://blogs.sun.com/DaveLevy
Email [EMAIL PROTECTED]
Sun Proprietary & Confidential . This e-mail message is for thesole use
of the intended recipient(s) and may contain confidential and
privilidged information. Any unauthorised review, use,disclosure or
distribution is prohibited. If you are not the intended recepient,
please contact the sender by reply e-mail and destroy all copiesof the
original message.

Re: About Tags, a proposed data-model

Reply via email to